Web scraping is an incredibly useful technique for extracting data from websites. Whether you need to collect product info from an ecommerce store, grab the latest sports scores, or analyze sentiment about a topic across news sites – web scraping allows you to fetch and parse the info you need from webpages.
While there are many ways to scrape websites, using Node.js and the Cheerio NPM package is one of the simplest and most effective methods. In this guide, I‘ll dive into what Cheerio is, how it works, and provide a complete walkthrough of using it to scrape data from a site.
What is Cheerio?
Cheerio is an open source Node.js library that makes it easy to extract data from webpages using familiar jQuery-like syntax. It provides a simple way to fetch the HTML of any webpage, parse it, select specific elements, and extract data from those elements.
Under the hood, Cheerio is built on top of the jsdom library and the htmlparser2 engine. It doesn‘t actually render the HTML or run any JavaScript on the page. Instead, it parses the raw HTML string and builds its own in-memory DOM tree that you can traverse and manipulate. This makes Cheerio very fast and efficient compared to headless browser-based scraping.
Some key features of Cheerio include:
- Uses familiar jQuery syntax for selecting and manipulating elements
- Lightweight and fast since it doesn‘t render the HTML
- Easy to install from NPM and use in Node.js scripts
- Parses messy/invalid HTML
- Supports gzip and deflate encodings
- Great for scraping data to save to databases, files, APIs, etc.
How to Use Cheerio for Web Scraping
Now that we know what Cheerio is, let‘s walk through a practical example of using it to scrape a webpage. We‘ll fetch the HTML of a site, select some elements, extract their data, and save it to a JSON file.
Step 1 – Install Cheerio
First, make sure you have Node.js and NPM installed. Then create a new directory for your project and initialize it:
mkdir cheerio-scraper
cd cheerio-scraper
npm init -y
Next, install Cheerio and Axios (which we‘ll use to fetch the webpage HTML):
npm install cheerio axios
Step 2 – Fetch the HTML
We‘ll use Axios to grab the HTML of the webpage we want to scrape. Create a file called `scraper.js` and add this code:
const axios = require(‘axios‘);
const cheerio = require(‘cheerio‘);
const url = ‘https://en.wikipedia.org/wiki/List_of_tallest_buildings‘;
axios(url)
.then(response => {
const html = response.data;
const $ = cheerio.load(html);
// We‘ll extract data here later
})
.catch(console.error);
This sends a GET request to the Wikipedia page listing the world‘s tallest buildings and loads the returned HTML into a Cheerio instance so we can parse it.
Step 3 – Select Elements and Extract Data
Cheerio provides jQuery-like methods for selecting elements from the parsed HTML document. We can use element selectors, classes, IDs, attributes, and more to pinpoint the elements we want.
Let‘s find the table of tallest buildings and extract the rank, name, height, and year data from each row:
const buildingsData = [];
$(‘table.wikitable tbody tr‘).each((i, el) => {
const row = $(el).text().split(‘\n‘);
const rank = row[1].trim();
const name = row[2].trim();
const height = row[4].trim();
const year = row[7].trim();
buildingsData.push({
rank,
name,
height,
year
});
});
Here we‘re finding the table body rows, splitting out the text of each table cell, and extracting it into an object that we push to the buildingsData
array.
Cheerio has many methods for selecting and filtering elements, such as:
$(selector)
– selects elements matching a CSS selectorfind(selector)
– finds descendant elements matching a selectorparent()
,siblings()
,children()
– traverses the DOM treefilter(selector)
– filters the current selectionfirst()
,last()
– gets the first or last element in the selectioneq(index)
– gets the element at a specific index
It also has methods for getting and setting data:
text()
– gets the text contents of the selected elementshtml()
– gets the inner HTML of the first selected elementattr(name)
– gets the value of an attributedata(name)
– gets the value of a data attributeval()
– gets the value of an input element
Refer to the Cheerio docs for the full list of methods available.
Step 4 – Save the Extracted Data
Finally, let‘s save our extracted data to a JSON file:
const fs = require(‘fs‘);
fs.writeFile(‘buildings.json‘, JSON.stringify(buildingsData, null, 2), (err) => {
if (err) {
console.error(err);
return;
}
console.log(‘Successfully scraped the data and saved it to buildings.json‘);
});
We use the built-in fs
module to write the JSON stringified buildingsData
to a file called buildings.json
.
To run our scraper, use the command:
node scraper.js
And that‘s it! You should see the JSON data saved in the buildings.json file.
Tips and Best Practices
Here are some tips to keep in mind when scraping with Cheerio:
- Websites can change their layout and class names often. Your scrapers may break and need updating occasionally.
- Respect website terms of service and robots.txt. Don‘t scrape too aggressively or you could get blocked.
- Handle errors and edge cases gracefully. Use try/catch and validate data before saving.
- Use concurrency and throttling when scraping multiple pages to avoid overloading servers.
- Cache pages when possible to reduce hits to websites and speed up your scraping.
- Rotate user agents and IP addresses if doing large scale scraping.
- Use sessions and cookies if you need to log in or maintain state across requests.
Limitations of Cheerio
While Cheerio is great for many scraping needs, it has some limitations:
- It doesn‘t render JavaScript, so it won‘t work on Single Page Apps or pages that load content dynamically after the initial HTML loads.
- It doesn‘t handle navigation or user interactions. You can‘t click buttons, fill out forms, etc.
For scraping more complex websites that rely heavily on JavaScript, you‘d need to use a headless browser-based solution like Puppeteer instead.
Conclusion
Cheerio is a powerful and easy-to-use library for scraping data from websites using Node.js. It lets you quickly fetch a webpage‘s HTML, parse it, and extract the data you need using familiar jQuery syntax.
I hope this guide has helped you understand how Cheerio works and how to put it to use in your own web scraping projects. Scrape on!