Scraping Websites Using the Cheerio NPM Package in Node.js

Web scraping is an incredibly useful technique for extracting data from websites. Whether you need to collect product info from an ecommerce store, grab the latest sports scores, or analyze sentiment about a topic across news sites – web scraping allows you to fetch and parse the info you need from webpages.

While there are many ways to scrape websites, using Node.js and the Cheerio NPM package is one of the simplest and most effective methods. In this guide, I‘ll dive into what Cheerio is, how it works, and provide a complete walkthrough of using it to scrape data from a site.

What is Cheerio?

Cheerio is an open source Node.js library that makes it easy to extract data from webpages using familiar jQuery-like syntax. It provides a simple way to fetch the HTML of any webpage, parse it, select specific elements, and extract data from those elements.

Under the hood, Cheerio is built on top of the jsdom library and the htmlparser2 engine. It doesn‘t actually render the HTML or run any JavaScript on the page. Instead, it parses the raw HTML string and builds its own in-memory DOM tree that you can traverse and manipulate. This makes Cheerio very fast and efficient compared to headless browser-based scraping.

Some key features of Cheerio include:

  • Uses familiar jQuery syntax for selecting and manipulating elements
  • Lightweight and fast since it doesn‘t render the HTML
  • Easy to install from NPM and use in Node.js scripts
  • Parses messy/invalid HTML
  • Supports gzip and deflate encodings
  • Great for scraping data to save to databases, files, APIs, etc.

How to Use Cheerio for Web Scraping

Now that we know what Cheerio is, let‘s walk through a practical example of using it to scrape a webpage. We‘ll fetch the HTML of a site, select some elements, extract their data, and save it to a JSON file.

Step 1 – Install Cheerio

First, make sure you have Node.js and NPM installed. Then create a new directory for your project and initialize it:

mkdir cheerio-scraper
cd cheerio-scraper
npm init -y

Next, install Cheerio and Axios (which we‘ll use to fetch the webpage HTML):

npm install cheerio axios

Step 2 – Fetch the HTML

We‘ll use Axios to grab the HTML of the webpage we want to scrape. Create a file called `scraper.js` and add this code:

const axios = require(‘axios‘);
const cheerio = require(‘cheerio‘);

const url = ‘https://en.wikipedia.org/wiki/List_of_tallest_buildings‘;

axios(url)
  .then(response => {
    const html = response.data;
    const $ = cheerio.load(html);
    // We‘ll extract data here later
  })
  .catch(console.error);  

This sends a GET request to the Wikipedia page listing the world‘s tallest buildings and loads the returned HTML into a Cheerio instance so we can parse it.

Step 3 – Select Elements and Extract Data

Cheerio provides jQuery-like methods for selecting elements from the parsed HTML document. We can use element selectors, classes, IDs, attributes, and more to pinpoint the elements we want.

Let‘s find the table of tallest buildings and extract the rank, name, height, and year data from each row:

const buildingsData = [];

$(‘table.wikitable tbody tr‘).each((i, el) => {
  const row = $(el).text().split(‘\n‘);
  const rank = row[1].trim();
  const name = row[2].trim();
  const height = row[4].trim();
  const year = row[7].trim();

  buildingsData.push({ 
    rank,
    name, 
    height, 
    year
  });
});

Here we‘re finding the table body rows, splitting out the text of each table cell, and extracting it into an object that we push to the buildingsData array.

Cheerio has many methods for selecting and filtering elements, such as:

  • $(selector) – selects elements matching a CSS selector
  • find(selector) – finds descendant elements matching a selector
  • parent(), siblings(), children() – traverses the DOM tree
  • filter(selector) – filters the current selection
  • first(), last() – gets the first or last element in the selection
  • eq(index) – gets the element at a specific index

It also has methods for getting and setting data:

  • text() – gets the text contents of the selected elements
  • html() – gets the inner HTML of the first selected element
  • attr(name) – gets the value of an attribute
  • data(name) – gets the value of a data attribute
  • val() – gets the value of an input element

Refer to the Cheerio docs for the full list of methods available.

Step 4 – Save the Extracted Data

Finally, let‘s save our extracted data to a JSON file:

const fs = require(‘fs‘);

fs.writeFile(‘buildings.json‘, JSON.stringify(buildingsData, null, 2), (err) => {
  if (err) {
    console.error(err);
    return;
  }
  console.log(‘Successfully scraped the data and saved it to buildings.json‘);
});

We use the built-in fs module to write the JSON stringified buildingsData to a file called buildings.json.

To run our scraper, use the command:

node scraper.js

And that‘s it! You should see the JSON data saved in the buildings.json file.

Tips and Best Practices

Here are some tips to keep in mind when scraping with Cheerio:

  • Websites can change their layout and class names often. Your scrapers may break and need updating occasionally.
  • Respect website terms of service and robots.txt. Don‘t scrape too aggressively or you could get blocked.
  • Handle errors and edge cases gracefully. Use try/catch and validate data before saving.
  • Use concurrency and throttling when scraping multiple pages to avoid overloading servers.
  • Cache pages when possible to reduce hits to websites and speed up your scraping.
  • Rotate user agents and IP addresses if doing large scale scraping.
  • Use sessions and cookies if you need to log in or maintain state across requests.

Limitations of Cheerio

While Cheerio is great for many scraping needs, it has some limitations:

  • It doesn‘t render JavaScript, so it won‘t work on Single Page Apps or pages that load content dynamically after the initial HTML loads.
  • It doesn‘t handle navigation or user interactions. You can‘t click buttons, fill out forms, etc.

For scraping more complex websites that rely heavily on JavaScript, you‘d need to use a headless browser-based solution like Puppeteer instead.

Conclusion

Cheerio is a powerful and easy-to-use library for scraping data from websites using Node.js. It lets you quickly fetch a webpage‘s HTML, parse it, and extract the data you need using familiar jQuery syntax.

I hope this guide has helped you understand how Cheerio works and how to put it to use in your own web scraping projects. Scrape on!

Did you like this post?

Click on a star to rate it!

Average rating 0 / 5. Vote count: 0

No votes so far! Be the first to rate this post.