Easy Web Scraping with Scrapy: A Beginner‘s Guide

Web scraping is the process of automatically extracting data from websites. It‘s an incredibly useful technique for gathering information that would be tedious and time-consuming to do manually. Python is one of the most popular languages for web scraping due to its simplicity and the many libraries available.

Navi.

While you can scrape websites using basic Python libraries like Requests and BeautifulSoup, the Scrapy framework takes things to the next level. Scrapy is a powerful and complete web scraping toolkit that handles many of the common tasks and challenges associated with harvesting data from the web.

In this beginner‘s guide, we‘ll walk through how to use Scrapy to easily scrape data from websites. By the end, you‘ll know how to extract information from a single page, crawl and scrape an entire website, and output the collected data. Let‘s get started!

What is Scrapy?

Scrapy is an open-source Python framework for writing web spiders to systematically crawl websites and extract structured data. It provides a complete ecosystem for web scraping, including:

Built-in support for selecting and extracting data using CSS selectors and XPath expressions
An interactive shell for testing your scraping code
Robust encoding support and auto-detection
Built-in support for generating feed exports in multiple formats (JSON, CSV, XML)
A media pipeline for downloading and storing files (images, PDFs, etc)
Caching and cookies persistence
Support for crawling authentication protected sites
Customizable middleware components for filtering requests/responses
Built-in extensions for handling common tasks like cookie handling, user-agent spoofing, restricting crawling depth, and more
A telnet console for hooking into and inspecting a running crawler

In other words, Scrapy provides a complete and mature web scraping solution, saving you from having to reinvent the wheel. It powers the scraping infrastructure of many well-known companies.

The major concepts in Scrapy are:

Spiders – classes that define how to crawl and parse pages for a particular website
Selectors – classes used to extract data from web pages using CSS selectors or XPath expressions
Items – containers for the scraped data; define the data you want to scrape
Item Loaders – classes that provide a convenient way to populate items
Item Pipelines – classes that process the scraped items (e.g. cleansing, validation, deduplication, storing in a database)
Middlewares – hook into Scrapy‘s request/response processing; allow you to modify requests and responses or take custom actions

While this may sound like a lot, you‘ll quickly see how all these pieces elegantly work together as we build some practical web scrapers. The beauty of Scrapy is that it provides an opinionated yet flexible framework for productively writing maintainable spiders.

Installing Scrapy

Before we can start scraping, we need to set up our development environment. It‘s highly recommended to install Scrapy inside a dedicated virtual environment to isolate its dependencies from other Python projects.

First make sure you have Python and pip installed. Then open a terminal and run:

pip install scrapy

This will install Scrapy and its dependencies. You can confirm it‘s properly installed by running:

scrapy version

Which should print out the installed version. At the time of writing, the latest Scrapy release is 2.5.1.

Creating a Basic Spider

We‘ll start by creating a Scrapy spider to scrape data from a single web page. To keep things simple, we‘ll use Scrapy‘s built-in project templates to set up our spider.

In your terminal, navigate to the directory where you want to create your Scrapy project and run:

scrapy startproject product_scraper

This will generate a product_scraper directory with the following structure:

product_scraper/
    scrapy.cfg            # configuration file
    product_scraper/      # project Python module
        __init__.py
        items.py          # items definition file 
        middlewares.py    # middlewares file
        pipelines.py      # pipelines file
        settings.py       # settings file
        spiders/          # directory to store spiders
            __init__.py

Next, go into the product_scraper directory:

cd product_scraper

And create a new spider using Scrapy‘s genspider command:

scrapy genspider product_spider example.com

The first argument is the name we‘re giving our spider and the second argument is the base URL of the website we want to scrape.

This will generate a basic spider template in product_scraper/spiders/product_spider.py that looks something like:

import scrapy

class ProductSpiderSpider(scrapy.Spider):
    name = ‘product_spider‘
    allowed_domains = [‘example.com‘]
    start_urls = [‘http://example.com/‘]

    def parse(self, response):
        pass

The name attribute is how we‘ll refer to our spider. The allowed_domains list specifies the base URLs this spider is allowed to crawl. This helps make sure the spider doesn‘t go off crawling the entire web. The start_urls list contains the initial URLs the spider will start crawling from. Finally, the parse method is where we‘ll write the code to extract data from each response.

Now let‘s modify this template for our product scraping task. For this example, we‘ll scrape a single product page from this dummy ecommerce site: https://clever-lichterman-044f16.netlify.app/products/taba-cream.1/

Update the spider code to:

import scrapy

class ProductSpider(scrapy.Spider):
    name = ‘product_spider‘
    allowed_domains = [‘clever-lichterman-044f16.netlify.app‘]
    start_urls = [‘https://clever-lichterman-044f16.netlify.app/products/taba-cream.1/‘]

    def parse(self, response):
        yield {
            ‘name‘: response.css(‘h1::text‘).get(),
            ‘price‘: response.css(‘span.my-4::text‘).get(),
            ‘description‘: response.css(‘div.mb-5 p::text‘).get(),
        }

This minimal spider:

Starts crawling at the given product URL
Extracts the product name, price, and description using CSS selectors
Yields the scraped data as a Python dict

To extract the data, we‘re using Scrapy‘s built-in CSS selector which works similarly to jQuery selections. For example, response.css(‘h1::text‘).get() selects the text content of the first h1 element on the page.

Let‘s run our spider and see it in action:

scrapy crawl product_spider -O product.json

This runs the product_spider, tells it to output the scraped data to a product.json file. When the crawl finishes, you should see the scraped product data in JSON format:

[
  {"name": "Taba Cream", 
   "price": "20.00$", 
   "description": "Taba Cream is a skin product used for treating spots, scars, dry skin..."
  }
]

And there you have it! Your first Scrapy spider. As you can see, Scrapy makes it very easy to quickly extract data from a page.

Of course, this is just a basic example. In a real project, you‘ll likely want to do more with the scraped data such as cleaning it up, transforming it, validating it, and storing it in a database.

Scrapy supports all of this through its item and item pipeline abstractions. We won‘t cover those here for brevity, but the basic idea is:

Define your data model as a Scrapy Item class (similar to a Django model)
Write an ItemPipeline to process and store Item instances
Enable the pipeline in your project settings.py

Scraping Multiple Pages

So far we‘ve only scraped a single page. But usually you‘ll want to crawl and scrape many pages. There are a few ways to accomplish this with Scrapy.

The simplest approach is to just add multiple URLs to the spider‘s start_urls list. For example:

start_urls = [
    ‘https://clever-lichterman-044f16.netlify.app/products/taba-cream.1/‘,
    ‘https://clever-lichterman-044f16.netlify.app/products/taba-facewash.1/‘,
    ‘https://clever-lichterman-044f16.netlify.app/products/taba-shampoo.1/‘
]

However, this is tedious if you need to scrape a large number of pages. Instead, you‘ll usually want to crawl pages by following links.

Link Following

To recursively follow links, you can use Scrapy‘s response.follow method. For example, let‘s say you want to scrape all products listed on a category page. The spider would look something like:

class CategorySpider(scrapy.Spider):
    name = ‘category_spider‘
    allowed_domains = [‘example.com‘]
    start_urls = [‘http://example.com/products/‘]

    def parse(self, response):
        products = response.css(‘div.product-list a‘)
        for product_link in products:
            yield response.follow(product_link, callback=self.parse_product)

    def parse_product(self, response):
        yield {
            ‘name‘: response.css(‘h1::text‘).get(),
            ‘price‘: response.css(‘span.my-4::text‘).get(),
            ‘description‘: response.css(‘div.mb-5 p::text‘).get(),
        }

This spider:

Starts at the category listing page
Extracts all the product detail page links
Follows each of those links and calls the parse_product method on the response
Yields the scraped product data from parse_product

The real power of Scrapy is its ability to handle crawling for you through these kinds of spiders. You simply define your starting URLs and tell it which links to recursively follow.

The CrawlSpider class takes this a step further by allowing you to define crawling rules in a more declarative way using regular expressions. But we won‘t cover that here.

Handling Pagination

Many listing pages implement pagination, so you‘ll need to follow the "next page" links to scrape all items.

Scrapy makes this straightforward by providing a shortcut for following pagination links:

def parse(self, response):
    products = response.css(‘div.product-list a‘)
    for product_link in products:
        yield response.follow(product_link, callback=self.parse_product)

    next_page = response.css(‘a.next-page::attr(href)‘).get()
    if next_page is not None:
        yield response.follow(next_page, callback=self.parse)

Here we:

Extract and follow the product links as before
Check for a "next page" link
If there is one, follow it and call the parse method on the response to extract the products on that next page.

Scrapy will continue following "next page" links until it doesn‘t find anymore, at which point it will have scraped all paginated items.

Handling Dynamic Content

One issue you may run into while scraping is content that is dynamically added to the page via JavaScript. Since Scrapy doesn‘t execute JavaScript by default, it won‘t see that dynamic content in the response HTML.

There are a few ways to handle this:

See if you can get the data another way, such as looking for an underlying API endpoint that returns the data as JSON
Use a headless browser like Puppeteer or Playwright to load the page, let the JavaScript execute, and then pass the rendered HTML to Scrapy
Use the Splash JavaScript rendering service together with Scrapy-Splash to integrate JavaScript support into your spiders

Explaining all these approaches is beyond the scope of this article, but the key takeaway is that it‘s possible to scrape websites that use JavaScript, you just need to find a way to let that JavaScript execute before passing the final HTML to Scrapy.

Closing Thoughts

You should now have a good grasp on how Scrapy can help you productively build robust web scrapers. We‘ve only just scratched the surface of what Scrapy can do, but the basic concepts of spiders, selectors, and response parsing will get you a long way.

Some additional topics to explore:

Exporting to different formats like CSV, XML, and databases
Using spider arguments to make your spiders reusable and customizable
Leveraging Scrapy‘s built-in logging
Feeding indata from files
Pausing and resuming crawls
Avoiding getting banned by websites

I encourage you to dive into the excellent official Scrapy documentation to continue your learning journey.

As you can see, Scrapy is an incredibly powerful tool for web scraping. While it has a bit of a learning curve compared to using plain Requests and BeautifulSoup, it‘s well worth the investment for any serious scraping projects.

So what are you waiting for? Go forth and scrape (ethically)!