Web scraping is an increasingly important skill for developers to extract valuable data from websites. According to a recent survey, 57% of data scientists and analysts use web scraping regularly in their work, and the web scraping software market is projected to reach $5.6 billion by 2027.
While it‘s possible to build scrapers from scratch using NodeJS libraries like Puppeteer or Cheerio, this can quickly become complex and time-consuming. Using an SDK like ScrapingBee‘s can greatly streamline the scraping process and provide powerful features out-of-the-box.
In this comprehensive tutorial, we‘ll walk through how to get started with ScrapingBee‘s NodeJS SDK and share expert tips to take your scraping projects to the next level.
Why Use an SDK for Web Scraping?
Scraping modern websites involves many potential roadblocks that can be tricky to handle on your own:
- Many sites render content dynamically with Javascript, which requires running a headless browser
- Servers may block requests from suspicious IP addresses or rate limit requests
- Sites frequently change their HTML structure, breaking brittle scrapers
Using an SDK like ScrapingBee can solve these issues by providing:
- A simple, high-level API to manage scraping jobs
- Built-in Javascript rendering using a headless Chrome browser
- Access to a huge pool of datacenter and residential proxies to avoid IP blocking
- Smart routing and request management to prevent overloading sites with requests
- Additional features like geotargeting, screenshot capture, and scheduled recurring jobs
Installing the ScrapingBee SDK
Getting started with ScrapingBee in NodeJS is a breeze. First, make sure you have a recent version of Node and NPM installed. Then you can install the SDK with:
npm install scrapingbee
Or to install it globally:
npm install -g scrapingbee
Making Your First Request
Once you have the SDK installed, you‘ll need an API key to authenticate your requests. Sign up for a free ScrapingBee account to get an API key. The free plan includes 1,000 credits per month, and paid plans start at just $29/month for 100k credits.
Here‘s a minimal example of how to use the SDK to scrape a webpage:
const scrapingbee = require(‘scrapingbee‘);
async function scrapeWebPage(url) {
const client = new scrapingbee.ScrapingBeeClient(‘YOUR_API_KEY‘);
const response = await client.get({
url: url,
params: {
render_js: true
}
});
console.log(response.data);
}
scrapeWebPage(‘https://example.com‘);
This will retrieve the HTML content of the page at https://example.com
, rendering any dynamic content.
Parsing and Extracting Data
In most cases, you‘ll want to extract structured data from the scraped HTML. ScrapingBee provides an extract_rules
parameter that allows you specify CSS selectors to pull out pieces of content:
const response = await client.get({
url: ‘https://news.ycombinator.com‘,
params: {
extract_rules: {
articles: {
selector: ‘.athing‘,
output: {
title: ‘a.storylink‘,
url: {selector:‘a.storylink‘, attr:‘href‘},
rank: {selector:‘.rank‘, attr:‘textContent‘}
}
}
}
}
});
console.log(response.data.articles);
This will return an array of articles from the Hacker News homepage, with the title, URL, and rank of each one. You can use tools like Chrome‘s inspector to figure out the correct selectors for the data you want to scrape.
Best Practices for Responsible Scraping
When scraping websites, it‘s important to be a good citizen and follow best practices:
- Respect
robots.txt
files that specify pages that should not be scraped - Limit your request rate to avoid overloading servers
- Consider scraping during off-peak hours for the website
- Don‘t republish content without permission or try to pass off scraped content as your own
ScrapingBee makes it easy to throttle your requests by using the wait
parameter to add a delay between requests. You can also use the max_jobs_per_minute
parameter to limit the overall throughput.
Real-World Usage and Case Studies
Web scraping has a wide variety of business and research applications. Here are a few real examples of companies using ScrapingBee:
- PriceBeam uses ScrapingBee to monitor their clients‘ products across multiple e-commerce sites and track pricing changes and inventory levels.
- LeadBoxer scrapes contact and social media data to enrich their B2B lead generation service.
- Zeotap, a data management platform, uses ScrapingBee to gather identity data from public sources to build marketing user profiles.
Learn More
This tutorial covers the basics of web scraping with ScrapingBee‘s NodeJS SDK, but there‘s much more you can do! Some additional things to check out:
- The ScrapingBee API Reference for the full range of parameters and options
- The official NodeJS SDK GitHub repo
- This in-depth guide to web scraping with NodeJS
- The ScrapingBee Blog for tutorials, case studies, and the latest news in the web scraping world
Happy scraping!