John Fegan - Strategy & Technical SEO Consultant

🚀 Quick Start

# Basic crawl
python main.py https://example.com

# Crawl with JavaScript rendering
python main.py https://example.com --js

# Crawl with custom limits
python main.py https://example.com --max-pages 100 --max-depth 3

# Crawl comparison (origin vs staging)
python main.py https://example.com --compare-domain https://staging.example.com

History

SQLiteCrawler is a stepping stone project I developed while working on my own still-in-progress PostgreSQL-based crawler. I'm sharing it here in case it's useful to others, but it's not a polished production tool - it's a working prototype that solved some specific problems I had with existing crawlers.

It started as a collection of ad-hoc scripts under the equally cleverly named SeoToolz but branched out to be a more fully formed crawler trying to solve a couple of personal pain points I have with the otherwise exceptional Screaming Frog:

A queryable DB: When I crawl the web I often want to get very specific information out of Screaming Frog that involves slow exports and then Excel lookups (or having to push them into a DB and doing lots of joins)
Retry functionality: Sometimes webpages aren't available on the first pass, pages time out or because we're getting blocked. The ability to retry failed URLs removes the need for manual intervention
The ability to compare across domains: SF does a solid job of comparing crawls of a single domain but is a lot less useful when you need to compare against staging sites, in particular when those sites don't entirely match up. SQLiteCrawler follows redirects on a staging site allowing you to compare the content on the origin, against the final destination on staging.
List restricted crawling: A personal pet peeve is the need to untick a large number of boxes to limit a crawl to only a list of provided URLs
Leverages curl_cffi: Bypass cloudflare and similar bot blocking services to crawl competitor sites that don't want to be crawled

✨ Key Features

Core Crawling

Persistent Frontier: Resume crawls from where you left off
Redirect Tracking: Complete redirect chain capture and storage
Content Extraction: Titles, meta descriptions, H1/H2 tags, robots directives, canonicals
Sitemap Discovery: Automatic XML sitemap parsing and URL discovery
Robots.txt Compliance: Respects crawling policies and analyzes crawlability

Advanced Analysis

Link Analysis: Internal/external link tracking with anchor text, XPath, and metadata
Schema.org Extraction: Extracts and validates JSON-LD, microdata, and RDFa structured data
Hreflang Support: Extracts and normalizes hreflang data from sitemaps
CSV Crawl Support: Crawl from predefined URL lists with restricted or seed modes
Content Hashing: SHA256 and SimHash for duplicate detection and content comparison

Crawl Comparison

Origin vs Staging: Compare production and staging environments
Content Analysis: Track title, H1, meta description, and word count changes
URL Move Detection: Identify content moved via 301 redirects
Comprehensive Views: Detailed analysis of differences and issues

Performance & Reliability

HTTP/2 & Brotli Support: Modern HTTP/2 client with Brotli compression
Intelligent Frontier Scoring: Prioritizes URLs by depth, sitemap priority, and inlinks
Database Normalization: Efficient storage with URL IDs and compressed content
Async Performance: Concurrent requests with configurable limits

🛠️ Installation

# Clone the repository
git clone https://github.com/user256/SQLiteCrawler.git
cd SQLiteCrawler

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -e .

# Optional: Install JavaScript rendering support
pip install -e .[js]
playwright install

📖 Usage Examples

Basic Crawling

# Simple crawl
python main.py https://example.com

# Crawl with JavaScript rendering
python main.py https://example.com --js

# Crawl with custom limits
python main.py https://example.com --max-pages 100 --max-depth 3

# Crawl with custom concurrency
python main.py https://example.com --concurrency 20 --delay 0.5

Authentication

# Basic authentication
python main.py https://example.com --auth-username user --auth-password pass --auth-type basic

# Digest authentication
python main.py https://example.com --auth-username user --auth-password pass --auth-type digest

# Bearer token
python main.py https://example.com --auth-token your-token --auth-type bearer

Crawl Comparison

# Basic comparison
python main.py https://example.com --compare-domain https://staging.example.com

# With commercial pages analysis
python main.py https://example.com --compare-domain https://staging.example.com --commercial-csv commercial.csv

# With detailed link comparison
python main.py https://example.com --compare-domain https://staging.example.com --compare-links

🔍 Example Queries

Basic Analysis

-- View crawled pages
SELECT url, title, h1_1, word_count, status_code 
FROM view_crawl_overview 
WHERE status_code = 200 
ORDER BY word_count DESC;

-- Check redirects
SELECT source_url, redirect_destination_url, chain_length 
FROM redirects 
WHERE chain_length > 1;

Comparison Analysis

-- View content differences
SELECT path, origin_title, staging_title, title_match
FROM view_content_differences 
WHERE overall_content_status = 'Content differences detected';

-- Check URL moves
SELECT path, moved_from_path, moved_to_path 
FROM view_url_moves;