Overview

In this guide, we will set up a Python environment to automate crawling and converting a documentation site into a consolidated PDF. This involves using chromedriver for automated browsing and wkhtmltopdf for converting HTML content to PDF.

Prerequisites

Make sure Homebrew is installed on your Mac. If it isn’t, you can install it by running:

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Step 1: Set Up the Python Environment

  1. Create a Virtual EnvironmentTo isolate the required packages, start by creating and activating a virtual environment:python3 -m venv venv source venv/bin/activate
  2. Install Required Python PackagesInside your virtual environment, install the following Python modules:pip install pyyaml pdfkit beautifulsoup4 requests tqdm Here’s a breakdown of the packages:
    • pyyaml: To handle YAML files (the site structure).
    • pdfkit: For converting HTML content to PDF.
    • beautifulsoup4: To parse and manipulate HTML content.
    • requests: For fetching HTML from web pages.
    • tqdm: For showing progress bars in the terminal.

Step 2: Install Required System Tools

  1. Install chromedriverchromedriver is essential for automating Chrome browser actions. Install it with Homebrew: brew install --cask chromedriver Once installed, make sure macOS allows it to execute by running:xattr -d com.apple.quarantine /opt/homebrew/bin/chromedriver Verify installation by checking the version:chromedriver --version
  2. Install wkhtmltopdf to convert HTML to PDF. Install it using Homebrew:brew install wkhtmltopdf To confirm it’s accessible, run: wkhtmltopdf --version

Step 3: Run the Crawler Script

The crawler script will automate Chrome to navigate through the documentation, retrieve links, and save the structure in a YAML file.

  1. Run the CrawlerExecute the crawler with the following command, specifying the starting URL and output YAML file. Here’s an example command: python3 crawler.py https://example.com/docs/ops/ops-overview --output urls.yaml
    • Starting URL: https://example.com/docs/ops/ops-overview
    • Output File: urls.yaml (a YAML file listing the documentation structure).
    The crawler will generate a YAML file, containing all links found. For example: Ops Overview: internal_links: - https://example.com/docs/onboard - https://example.com/docs/onboard/configure-cloud-project ...
  2. Edit the YAML FileOpen the generated site_structure.yaml file in a text editor. Review and manually remove any sections or links that you don’t want to include in the final PDF. This is a necessary step to customize your final document.

Step 4: Convert the Documentation to PDF

Once you’ve finalized the YAML file, run the conversion script to create the PDF.

  1. Run the Converter ScriptUse the following command to convert the pages in the YAML file to a single PDF: python3 convert-working.py site_structure.yaml --output combined_output.pdf
    • urls.yaml: Input YAML file with the site structure.
    • --output combined_output.pdf: Name of the output PDF file.
    This script performs the following tasks:
    • Fetches each URL listed in the YAML file.
    • Cleans up the content by removing unnecessary headers and footers.
    • Saves each page as an individual PDF.
    • Combines all PDFs into a single file (combined_output.pdf).
  2. View Your PDF The final output, combined_output.pdf, will contain all the selected documentation pages in a single, neatly formatted PDF.

Summary

You’ve now set up a complete workflow for converting documentation into a consolidated PDF. Here’s a quick recap of the steps:

  1. Set up the environment: Create a virtual environment and install the required dependencies.
  2. Install system tools: Install chromedriver and wkhtmltopdf using Homebrew.
  3. Run the crawler: Generate a structured YAML file of documentation links.
  4. Edit the YAML file: Customize the content by removing any unwanted links.
  5. Run the converter: Convert the selected documentation into a single PDF.

Files: https://github.com/mattclemons/crawler

Categories: Programming

0 Comments

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *