Web scraping is the process of automatically extracting data from websites. It‘s an incredibly useful technique for gathering information that would be tedious and time-consuming to do manually. Python is one of the most popular languages for web scraping due to its simplicity and the many libraries available.
While you can scrape websites using basic Python libraries like Requests and BeautifulSoup, the Scrapy framework takes things to the next level. Scrapy is a powerful and complete web scraping toolkit that handles many of the common tasks and challenges associated with harvesting data from the web.
In this beginner‘s guide, we‘ll walk through how to use Scrapy to easily scrape data from websites. By the end, you‘ll know how to extract information from a single page, crawl and scrape an entire website, and output the collected data. Let‘s get started!
What is Scrapy?
Scrapy is an open-source Python framework for writing web spiders to systematically crawl websites and extract structured data. It provides a complete ecosystem for web scraping, including:
- Built-in support for selecting and extracting data using CSS selectors and XPath expressions
- An interactive shell for testing your scraping code
- Robust encoding support and auto-detection
- Built-in support for generating feed exports in multiple formats (JSON, CSV, XML)
- A media pipeline for downloading and storing files (images, PDFs, etc)
- Caching and cookies persistence
- Support for crawling authentication protected sites
- Customizable middleware components for filtering requests/responses
- Built-in extensions for handling common tasks like cookie handling, user-agent spoofing, restricting crawling depth, and more
- A telnet console for hooking into and inspecting a running crawler
In other words, Scrapy provides a complete and mature web scraping solution, saving you from having to reinvent the wheel. It powers the scraping infrastructure of many well-known companies.
The major concepts in Scrapy are:
- Spiders – classes that define how to crawl and parse pages for a particular website
- Selectors – classes used to extract data from web pages using CSS selectors or XPath expressions
- Items – containers for the scraped data; define the data you want to scrape
- Item Loaders – classes that provide a convenient way to populate items
- Item Pipelines – classes that process the scraped items (e.g. cleansing, validation, deduplication, storing in a database)
- Middlewares – hook into Scrapy‘s request/response processing; allow you to modify requests and responses or take custom actions
While this may sound like a lot, you‘ll quickly see how all these pieces elegantly work together as we build some practical web scrapers. The beauty of Scrapy is that it provides an opinionated yet flexible framework for productively writing maintainable spiders.
Installing Scrapy
Before we can start scraping, we need to set up our development environment. It‘s highly recommended to install Scrapy inside a dedicated virtual environment to isolate its dependencies from other Python projects.
First make sure you have Python and pip installed. Then open a terminal and run:
pip install scrapy
This will install Scrapy and its dependencies. You can confirm it‘s properly installed by running:
scrapy version
Which should print out the installed version. At the time of writing, the latest Scrapy release is 2.5.1.
Creating a Basic Spider
We‘ll start by creating a Scrapy spider to scrape data from a single web page. To keep things simple, we‘ll use Scrapy‘s built-in project templates to set up our spider.
In your terminal, navigate to the directory where you want to create your Scrapy project and run:
scrapy startproject product_scraper
This will generate a product_scraper
directory with the following structure:
product_scraper/
scrapy.cfg # configuration file
product_scraper/ # project Python module
__init__.py
items.py # items definition file
middlewares.py # middlewares file
pipelines.py # pipelines file
settings.py # settings file
spiders/ # directory to store spiders
__init__.py
Next, go into the product_scraper
directory:
cd product_scraper
And create a new spider using Scrapy‘s genspider command:
scrapy genspider product_spider example.com
The first argument is the name we‘re giving our spider and the second argument is the base URL of the website we want to scrape.
This will generate a basic spider template in product_scraper/spiders/product_spider.py
that looks something like:
import scrapy
class ProductSpiderSpider(scrapy.Spider):
name = ‘product_spider‘
allowed_domains = [‘example.com‘]
start_urls = [‘http://example.com/‘]
def parse(self, response):
pass
The name
attribute is how we‘ll refer to our spider. The allowed_domains
list specifies the base URLs this spider is allowed to crawl. This helps make sure the spider doesn‘t go off crawling the entire web. The start_urls
list contains the initial URLs the spider will start crawling from. Finally, the parse
method is where we‘ll write the code to extract data from each response.
Now let‘s modify this template for our product scraping task. For this example, we‘ll scrape a single product page from this dummy ecommerce site: https://clever-lichterman-044f16.netlify.app/products/taba-cream.1/
Update the spider code to:
import scrapy
class ProductSpider(scrapy.Spider):
name = ‘product_spider‘
allowed_domains = [‘clever-lichterman-044f16.netlify.app‘]
start_urls = [‘https://clever-lichterman-044f16.netlify.app/products/taba-cream.1/‘]
def parse(self, response):
yield {
‘name‘: response.css(‘h1::text‘).get(),
‘price‘: response.css(‘span.my-4::text‘).get(),
‘description‘: response.css(‘div.mb-5 p::text‘).get(),
}
This minimal spider:
- Starts crawling at the given product URL
- Extracts the product name, price, and description using CSS selectors
- Yields the scraped data as a Python dict
To extract the data, we‘re using Scrapy‘s built-in CSS selector which works similarly to jQuery selections. For example, response.css(‘h1::text‘).get()
selects the text content of the first h1
element on the page.
Let‘s run our spider and see it in action:
scrapy crawl product_spider -O product.json
This runs the product_spider
, tells it to output the scraped data to a product.json
file. When the crawl finishes, you should see the scraped product data in JSON format:
[
{"name": "Taba Cream",
"price": "20.00$",
"description": "Taba Cream is a skin product used for treating spots, scars, dry skin..."
}
]
And there you have it! Your first Scrapy spider. As you can see, Scrapy makes it very easy to quickly extract data from a page.
Of course, this is just a basic example. In a real project, you‘ll likely want to do more with the scraped data such as cleaning it up, transforming it, validating it, and storing it in a database.
Scrapy supports all of this through its item and item pipeline abstractions. We won‘t cover those here for brevity, but the basic idea is:
- Define your data model as a Scrapy
Item
class (similar to a Django model) - Write an
ItemPipeline
to process and storeItem
instances - Enable the pipeline in your project
settings.py
Scraping Multiple Pages
So far we‘ve only scraped a single page. But usually you‘ll want to crawl and scrape many pages. There are a few ways to accomplish this with Scrapy.
The simplest approach is to just add multiple URLs to the spider‘s start_urls
list. For example:
start_urls = [
‘https://clever-lichterman-044f16.netlify.app/products/taba-cream.1/‘,
‘https://clever-lichterman-044f16.netlify.app/products/taba-facewash.1/‘,
‘https://clever-lichterman-044f16.netlify.app/products/taba-shampoo.1/‘
]
However, this is tedious if you need to scrape a large number of pages. Instead, you‘ll usually want to crawl pages by following links.
Link Following
To recursively follow links, you can use Scrapy‘s response.follow
method. For example, let‘s say you want to scrape all products listed on a category page. The spider would look something like:
class CategorySpider(scrapy.Spider):
name = ‘category_spider‘
allowed_domains = [‘example.com‘]
start_urls = [‘http://example.com/products/‘]
def parse(self, response):
products = response.css(‘div.product-list a‘)
for product_link in products:
yield response.follow(product_link, callback=self.parse_product)
def parse_product(self, response):
yield {
‘name‘: response.css(‘h1::text‘).get(),
‘price‘: response.css(‘span.my-4::text‘).get(),
‘description‘: response.css(‘div.mb-5 p::text‘).get(),
}
This spider:
- Starts at the category listing page
- Extracts all the product detail page links
- Follows each of those links and calls the
parse_product
method on the response - Yields the scraped product data from
parse_product
The real power of Scrapy is its ability to handle crawling for you through these kinds of spiders. You simply define your starting URLs and tell it which links to recursively follow.
The CrawlSpider
class takes this a step further by allowing you to define crawling rules in a more declarative way using regular expressions. But we won‘t cover that here.
Handling Pagination
Many listing pages implement pagination, so you‘ll need to follow the "next page" links to scrape all items.
Scrapy makes this straightforward by providing a shortcut for following pagination links:
def parse(self, response):
products = response.css(‘div.product-list a‘)
for product_link in products:
yield response.follow(product_link, callback=self.parse_product)
next_page = response.css(‘a.next-page::attr(href)‘).get()
if next_page is not None:
yield response.follow(next_page, callback=self.parse)
Here we:
- Extract and follow the product links as before
- Check for a "next page" link
- If there is one, follow it and call the
parse
method on the response to extract the products on that next page.
Scrapy will continue following "next page" links until it doesn‘t find anymore, at which point it will have scraped all paginated items.
Handling Dynamic Content
One issue you may run into while scraping is content that is dynamically added to the page via JavaScript. Since Scrapy doesn‘t execute JavaScript by default, it won‘t see that dynamic content in the response HTML.
There are a few ways to handle this:
- See if you can get the data another way, such as looking for an underlying API endpoint that returns the data as JSON
- Use a headless browser like Puppeteer or Playwright to load the page, let the JavaScript execute, and then pass the rendered HTML to Scrapy
- Use the Splash JavaScript rendering service together with Scrapy-Splash to integrate JavaScript support into your spiders
Explaining all these approaches is beyond the scope of this article, but the key takeaway is that it‘s possible to scrape websites that use JavaScript, you just need to find a way to let that JavaScript execute before passing the final HTML to Scrapy.
Closing Thoughts
You should now have a good grasp on how Scrapy can help you productively build robust web scrapers. We‘ve only just scratched the surface of what Scrapy can do, but the basic concepts of spiders, selectors, and response parsing will get you a long way.
Some additional topics to explore:
- Exporting to different formats like CSV, XML, and databases
- Using spider arguments to make your spiders reusable and customizable
- Leveraging Scrapy‘s built-in logging
- Feeding indata from files
- Pausing and resuming crawls
- Avoiding getting banned by websites
I encourage you to dive into the excellent official Scrapy documentation to continue your learning journey.
As you can see, Scrapy is an incredibly powerful tool for web scraping. While it has a bit of a learning curve compared to using plain Requests and BeautifulSoup, it‘s well worth the investment for any serious scraping projects.
So what are you waiting for? Go forth and scrape (ethically)!