The Definitive Guide to Web Crawling with Python

Web crawling is the process of programmatically visiting web pages to collect data and discover new pages by following links. It powers major services like search engines and online archives. Python is the most popular language for writing web crawlers due to its simplicity and extensive ecosystem. In this guide, you‘ll learn the fundamentals of web crawling, see how to build crawlers from first principles to an advanced framework, and understand the key challenges and best practices for crawling at scale.

The scale and growth of the web

The web is massive and growing exponentially. There are over 1.9 billion websites as of 2022, containing over 6 billion individual pages^1. Every day, over 5 million new web pages are created^2. Major web properties update even more frequently – YouTube receives over 500 hours of new video content per minute^3.

Search engines like Google and Bing aim to crawl a significant portion of the web in order to provide up-to-date, comprehensive results to user queries. Google is estimated to have indexed over 50 billion web pages^4. Its crawlers process over 130 trillion pages per year^5.

This staggering scale introduces significant challenges for web crawling systems. Crawlers must:

  • Discover and prioritize new pages efficiently
  • Refresh existing content to ensure freshness
  • Distribute the crawling across many machines
  • Respect website owner preferences and constraints
  • Filter out spam and low quality content
  • Extract and process data in many formats

How web crawlers work

At their core, all web crawlers perform these basic steps:

  1. Download the HTML for a given URL
  2. Extract the links to other pages
  3. Add those links to a queue of pages to crawl
  4. Extract any relevant data and save it
  5. Repeat with the next URL in the queue

The crawler typically starts with a manually curated seed set of initial URLs. It then performs the above steps recursively until a certain stop condition is met, such as:

  • Reaching a maximum number of pages
  • Reaching a maximum crawl depth from the seeds
  • The queue of discovered URLs is empty
  • A time limit is hit

Visualizing the flow:

graph LR
    A[Seed URLs] --> B(Download page)
    B --> C{Extract links}
    C -->|Add new links| D[URL queue]
    C -->|Extract data| E[Extracted data store]
    D --> B

Key decisions in the crawling algorithm include:

  • Prioritization and ordering of URLs in the queue (breadth vs depth-first)
  • Re-visit policy for already crawled pages
  • Politeness settings like crawl delay and parallel requests per domain
  • Filter rules for allowed vs disallowed URLs

Production crawlers introduce many more components for efficiency and robustness:

  • Distributed fetching across many worker nodes
  • URL de-duplication (typically via a bloom filter)
  • Persistent storage for the URL queue and results
  • Fault tolerance via checkpointing
  • Modularity via a plugin architecture
  • Monitoring and alerts

Example: Building a crawler from scratch

To make the basic concepts concrete, let‘s implement a simple crawler in Python using just the standard libraries and requests/beautifulsoup.

from collections import deque
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse, urljoin

class Crawler:

    def __init__(self, seed_url):
        self.visited_urls = set([]) 
        self.url_queue = deque([seed_url])

    def download_page(self, url):
        return requests.get(url).text

    def extract_links(self, html, base_url):
        soup = BeautifulSoup(html, ‘html.parser‘)
        for link in soup.find_all(‘a‘):
            link_url = link.get(‘href‘)
            if link_url and link_url.startswith(‘/‘):
                link_url = urljoin(base_url, link_url)
            yield link_url

    def add_url_to_queue(self, url):
        if url and (url not in self.visited_urls) and (url not in self.url_queue):
            self.url_queue.append(url)

    def process_page(self, url, html):
        print(‘Crawling: ‘, url)  
        # TODO: Extract and save data 
        for link_url in self.extract_links(html, url):
            self.add_url_to_queue(link_url)

    def crawl(self):  
        while self.url_queue:
            current_url = self.url_queue.popleft()
            html = self.download_page(current_url) 
            self.visited_urls.add(current_url)
            self.process_page(current_url, html)

Crawler(‘https://rickyspears.com/scraper/‘).crawl()            

This hits the key points:

  • Uses a queue (via Python deque) for tracking pages to crawl with O(1) append and pop
  • Uses a set for tracking visited URLs for O(1) lookup
  • Extracts links and recursively crawls them
  • Avoids re-crawling already visited URLs
  • Normalizes relative URLs

However, it is missing major features needed for production:

  • Respecting robots.txt
  • Politeness / rate limiting
  • Error handling and retries
  • Saving results
  • Filtering URLs by patterns
  • Distributing across machines

It‘s OK for learning purposes, but for real-world projects, it‘s better to turn to a production-grade framework.

Advanced crawling with Scrapy

Scrapy is the most popular Python framework for building web crawlers. It powers crawlers for major companies like Yelp, Glovo, Zyte, Parse.ly and more. Scrapy provides:

  • Built-in concurrency via asynchronous requests
  • Middleware for respecting robots.txt, dropping duplicate requests, retrying on failure, logging, etc
  • "Spider" classes for implementing the crawling logic
  • "Item Pipeline" classes for cleaning, saving and exporting the collected data
  • Detailed stats collection
  • Templating language for extracting data
  • Shell for interactively testing extraction
  • Extensibility via signals and plugins
  • Cloud-based platform Scrapy Cloud for easy deployment

The Scrapy architecture looks like:

[Scrapy Architecture Diagram]

The typical workflow is:

  1. URLs are submitted to the Engine to be scheduled
  2. The Engine grabs a URL from the Scheduler, filters it through the Downloader Middlewares, and submits a Request
  3. The Downloader fetches the page and returns a Response
  4. The Response flows back through the Downloader and Spider Middlewares to the Spider
  5. The Spider parses the Response to extract links (Requests) and data (Items)
  6. Extracted Items flow through the Item Pipelines to be saved
  7. Extracted Requests flow back to the Engine to be scheduled

Here‘s what a Scrapy spider looks like:

import scrapy

class ExampleSpider(scrapy.Spider):
    name = ‘example‘
    start_urls = [‘https://www.zyte.com/blog/‘]

    def parse(self, response):
        for article in response.css(‘div.oxy-post‘):
            yield {
                ‘title‘: article.css(‘a.oxy-post-title::text‘).get(),
                ‘author‘: article.css(‘div.oxy-post-meta-author::text‘).get(),
                ‘date‘: article.css(‘div.oxy-post-meta-date::text‘).get(),
            }

        older_posts = response.css(‘a.next::attr(href)‘).get()
        if older_posts:
            yield scrapy.Request(url=older_posts, callback=self.parse)

This defines a spider that:

  • Starts at https://www.zyte.com/blog/
  • Extracts article data using CSS selectors
  • Yields Python dicts for each article
  • Finds the "Next" link and follows it to crawl older posts
  • Automatically handles the spider-engine interactions, concurrent requests, stats, etc

While this is a simple example, Scrapy spiders can implement arbitrarily complex crawling logic while leveraging the framework for the heavy lifting. See the Scrapy docs for details on all the features and best practices.

The challenges of production crawling

Scrapy makes it easy to get started with web crawling. But there are still challenges to address before web crawling at commercial scale:

Avoiding crawler traps: Websites may contain infinite loops, calendar pages, faceted navigation, or other link patterns that cause crawler to get stuck. Techniques to avoid traps include ignoring certain URL parameters, limiting crawl depth, and detecting URL patterns.

Filtering content: Many pages contain duplicate or boilerplate content of little value. Or content may be behind paywalls or login forms. Crawlers need to detect and ignore this content to save resources. Techniques include rules based on URL patterns, content size, and looking for certain HTML markers.

Anti-bot countermeasures: Websites increasingly try to block crawlers using CAPTCHAs, JavaScript challenges, browser fingerprinting and IP rate limits. Crawlers need to look like real users by rotating user agents, IP addresses, and potentially simulating full browser behavior.

Quality control: At scale, crawlers will inevitably hit some pages that fail to parse, get blocked, or return invalid data. Pipelines need to validate data, handle errors gracefully, and alert on quality issues.

Performance optimization: Large crawlers can bottleneck on CPU for HTML parsing, network for requests, memory for the URL queue, or disk for saving results. Optimizing performance requires profiling, tuning settings, and potentially distributing components across clusters.

Data consistency: Websites are constantly changing. Crawlers need to ensure data stays fresh by recrawling on a schedule. But they must carefully manage the explosion of new URLs to prioritize for a finite crawl capacity.

Legal compliance: There‘s a lot of gray area around the legality of crawling. Crawlers should respect the letter and spirit of robots.txt to avoid complaints. Copyright law is complex when it comes to data reuse. And crawling any personal information introduces strict compliance requirements. Large crawlers need policies and sometimes legal counsel.

Use cases and case studies

Some of the applications for web crawlers include:

Search engines: Crawlers like GoogleBot, BingBot and BaiduSpider crawl the web to build search indexes. They use sophisticated priorities and recrawling strategies to ensure comprehensive, fresh results.

Price monitoring: Tools like PriceWars and PricePinx crawl retailer websites to track competitor prices and identify opportunities. They must account for sales and out of stock issues.

Job boards: Sites like Indeed and ZipRecruiter crawl company websites to aggregate job postings into a single interface. They normalize postings into a standard schema and remove duplicates.

Market research: VC funds like SignalFire and asset managers like Thinknum and Yipit crawl the web to understand company and economic performance. They correlate data across sources to predict trends.

Web archives: The Internet Archive, archive.org, and Common Crawl aim to preserve the web for posterity and research. They perform broad crawls and make the raw data available to all.

Some case studies:

  • Indeed built a Scrapy-based crawler to power their job search engine. It crawls over 1M company websites with custom spiders and a multi-step Item Pipeline^7.
  • MDN Web Docs used Scrapy to migrate over 10k pages from their previous wiki platform. They leveraged Scrapy‘s extensibility to implement custom middleware and pipelines^8.
  • Sidecar, a marketing platform, replaced their legacy crawlers with Scrapy for improved flexibility and robustness, reducing failure rates from 20% to 0.23%^9.

Ethics and responsibility

With great crawling power comes great responsibility. Web crawlers can inadvertently cause harm to websites and people. Some ethical principles:

  • Always respect robots.txt. It‘s the universal mechanism for websites to signal their preferences to crawlers. Violating it is considered hostile.
  • Don‘t hit servers too hard. A flood of crawler requests can overload servers and block real users. Introduce politeness delays and monitor for signs of trouble.
  • Protect personal info. Be extremely careful about collecting any names, contact info, or other personally identifiable information. You likely need explicit consent.
  • Get permission for reuse. Republishing data collected via crawling can violate copyrights. The safe approach is to use it only for analysis. If you want to share data publicly, get permission first.
  • Consider the human impact. A crawler isn‘t just hitting a server, but potentially impacting people‘s livelihoods, privacy and well-being on the other side. Design crawlers with empathy for all.

Ethics in web crawling is still an emerging area as the technology pushes ahead of social norms and regulations. But upholding basic principles of respect, consent and care for fellow humans online is always the right thing to do.

Conclusion

Web crawling is a powerful tool for collecting data from the Internet. Python is the most popular language for writing crawlers due to strong libraries like Scrapy. But with that power comes the responsibility to crawl ethically and at scale requires addressing hard challenges.

If you‘re embarking on a web crawling project, here‘s a quick checklist:

  • Determine your crawling goals and metrics
  • Check for an API or structured data feeds first
  • Design your crawling strategy and data model
  • Evaluate open source and commercial crawling tools
  • Test extraction quality and throughput
  • Set up error monitoring and quality alerts
  • Ensure you‘re crawling legally and ethically
  • Have a plan to maintain the crawler over time

When you need data from the web, it‘s tempting to jump right in and start writing a crawler. But take the time upfront to plan your approach to save pain down the line. And remember, with great crawling power comes great responsibility. Now go forth and crawl for good!

Did you like this post?

Click on a star to rate it!

Average rating 1 / 5. Vote count: 1

No votes so far! Be the first to rate this post.