A web crawler for those comfortable with Python and SQLite. Features URL discovery, redirect tracking, content extraction and comparison, including across sites.
Note: This is a stepping stone I developed on the way to my own still WIP PostgreSQL-based crawler. I'm sharing it in case it's useful, but it's not a polished production tool.
# Basic crawl
python main.py https://example.com
# Crawl with JavaScript rendering
python main.py https://example.com --js
# Crawl with custom limits
python main.py https://example.com --max-pages 100 --max-depth 3
# Crawl comparison (origin vs staging)
python main.py https://example.com --compare-domain https://staging.example.com
SQLiteCrawler is a stepping stone project I developed while working on my own still-in-progress PostgreSQL-based crawler. I'm sharing it here in case it's useful to others, but it's not a polished production tool - it's a working prototype that solved some specific problems I had with existing crawlers.
It started as a collection of ad-hoc scripts under the equally cleverly named SeoToolz but branched out to be a more fully formed crawler trying to solve a couple of personal pain points I have with the otherwise exceptional Screaming Frog:
# Clone the repository
git clone https://github.com/user256/SQLiteCrawler.git
cd SQLiteCrawler
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -e .
# Optional: Install JavaScript rendering support
pip install -e .[js]
playwright install
# Simple crawl
python main.py https://example.com
# Crawl with JavaScript rendering
python main.py https://example.com --js
# Crawl with custom limits
python main.py https://example.com --max-pages 100 --max-depth 3
# Crawl with custom concurrency
python main.py https://example.com --concurrency 20 --delay 0.5
# Basic authentication
python main.py https://example.com --auth-username user --auth-password pass --auth-type basic
# Digest authentication
python main.py https://example.com --auth-username user --auth-password pass --auth-type digest
# Bearer token
python main.py https://example.com --auth-token your-token --auth-type bearer
# Basic comparison
python main.py https://example.com --compare-domain https://staging.example.com
# With commercial pages analysis
python main.py https://example.com --compare-domain https://staging.example.com --commercial-csv commercial.csv
# With detailed link comparison
python main.py https://example.com --compare-domain https://staging.example.com --compare-links
-- View crawled pages
SELECT url, title, h1_1, word_count, status_code
FROM view_crawl_overview
WHERE status_code = 200
ORDER BY word_count DESC;
-- Check redirects
SELECT source_url, redirect_destination_url, chain_length
FROM redirects
WHERE chain_length > 1;
-- View content differences
SELECT path, origin_title, staging_title, title_match
FROM view_content_differences
WHERE overall_content_status = 'Content differences detected';
-- Check URL moves
SELECT path, moved_from_path, moved_to_path
FROM view_url_moves;