Open Source Project

SQLiteCrawler

A web crawler for those comfortable with Python and SQLite. Features URL discovery, redirect tracking, content extraction and comparison, including across sites.

Note: This is a stepping stone I developed on the way to my own still WIP PostgreSQL-based crawler. I'm sharing it in case it's useful, but it's not a polished production tool.

🚀 Quick Start

# Basic crawl
python main.py https://example.com

# Crawl with JavaScript rendering
python main.py https://example.com --js

# Crawl with custom limits
python main.py https://example.com --max-pages 100 --max-depth 3

# Crawl comparison (origin vs staging)
python main.py https://example.com --compare-domain https://staging.example.com

History

SQLiteCrawler is a stepping stone project I developed while working on my own still-in-progress PostgreSQL-based crawler. I'm sharing it here in case it's useful to others, but it's not a polished production tool - it's a working prototype that solved some specific problems I had with existing crawlers.

It started as a collection of ad-hoc scripts under the equally cleverly named SeoToolz but branched out to be a more fully formed crawler trying to solve a couple of personal pain points I have with the otherwise exceptional Screaming Frog:

✨ Key Features

Core Crawling

  • Persistent Frontier: Resume crawls from where you left off
  • Redirect Tracking: Complete redirect chain capture and storage
  • Content Extraction: Titles, meta descriptions, H1/H2 tags, robots directives, canonicals
  • Sitemap Discovery: Automatic XML sitemap parsing and URL discovery
  • Robots.txt Compliance: Respects crawling policies and analyzes crawlability

Advanced Analysis

  • Link Analysis: Internal/external link tracking with anchor text, XPath, and metadata
  • Schema.org Extraction: Extracts and validates JSON-LD, microdata, and RDFa structured data
  • Hreflang Support: Extracts and normalizes hreflang data from sitemaps
  • CSV Crawl Support: Crawl from predefined URL lists with restricted or seed modes
  • Content Hashing: SHA256 and SimHash for duplicate detection and content comparison

Crawl Comparison

  • Origin vs Staging: Compare production and staging environments
  • Content Analysis: Track title, H1, meta description, and word count changes
  • URL Move Detection: Identify content moved via 301 redirects
  • Comprehensive Views: Detailed analysis of differences and issues

Performance & Reliability

  • HTTP/2 & Brotli Support: Modern HTTP/2 client with Brotli compression
  • Intelligent Frontier Scoring: Prioritizes URLs by depth, sitemap priority, and inlinks
  • Database Normalization: Efficient storage with URL IDs and compressed content
  • Async Performance: Concurrent requests with configurable limits

🛠️ Installation

# Clone the repository
git clone https://github.com/user256/SQLiteCrawler.git
cd SQLiteCrawler

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -e .

# Optional: Install JavaScript rendering support
pip install -e .[js]
playwright install

📖 Usage Examples

Basic Crawling

# Simple crawl
python main.py https://example.com

# Crawl with JavaScript rendering
python main.py https://example.com --js

# Crawl with custom limits
python main.py https://example.com --max-pages 100 --max-depth 3

# Crawl with custom concurrency
python main.py https://example.com --concurrency 20 --delay 0.5

Authentication

# Basic authentication
python main.py https://example.com --auth-username user --auth-password pass --auth-type basic

# Digest authentication
python main.py https://example.com --auth-username user --auth-password pass --auth-type digest

# Bearer token
python main.py https://example.com --auth-token your-token --auth-type bearer

Crawl Comparison

# Basic comparison
python main.py https://example.com --compare-domain https://staging.example.com

# With commercial pages analysis
python main.py https://example.com --compare-domain https://staging.example.com --commercial-csv commercial.csv

# With detailed link comparison
python main.py https://example.com --compare-domain https://staging.example.com --compare-links

🔍 Example Queries

Basic Analysis

-- View crawled pages
SELECT url, title, h1_1, word_count, status_code 
FROM view_crawl_overview 
WHERE status_code = 200 
ORDER BY word_count DESC;

-- Check redirects
SELECT source_url, redirect_destination_url, chain_length 
FROM redirects 
WHERE chain_length > 1;

Comparison Analysis

-- View content differences
SELECT path, origin_title, staging_title, title_match
FROM view_content_differences 
WHERE overall_content_status = 'Content differences detected';

-- Check URL moves
SELECT path, moved_from_path, moved_to_path 
FROM view_url_moves;