How to Use Proxies with Node-Fetch for Web Scraping in 2023

Web scraping, the process of programmatically extracting data from websites, has become an essential tool for many businesses and researchers. It enables gathering large amounts of publicly available data to gain insights, monitor competitors, or build new applications.

However, many websites employ measures to detect and block web scrapers. According to a 2020 study by Imperva, 18% of all website traffic comes from "bad bots" like scrapers and crawlers. As a result, over a quarter of websites now use some form of bot detection.

One of the most common techniques for avoiding detection is to use proxy servers. A proxy acts as an intermediary between your scraper and the target website, forwarding requests through an alternate IP address. This helps conceal your scraper‘s true identity.

Types of Proxies for Web Scraping

Not all proxies are suitable for web scraping. Here are the main types of proxies and their characteristics:

Proxy TypeIP OwnershipIP DiversitySuccess RateCost
Data CenterProxy providerLowLow$
ResidentialReal usersHighHigh$$$
Mobile4G/5G usersVery highVery high$$$$

Data center proxies are the most common and cheapest type. They use IP addresses owned by the proxy provider, usually from cloud hosting data centers. The main drawback is these IPs are easily identified and blocked by websites.

Residential proxies use IP addresses assigned to real home users by their ISP. These are much harder to detect as scrapers, but are more expensive and raise some ethical concerns around consent.

Mobile proxies route requests through real 4G and 5G mobile devices. They have the highest success rates, as mobile IPs are rarely blocked. However, they are very expensive and have limited bandwidth.

In a 2022 benchmark of residential proxy providers, the top performers had success rates over 90% with average response times under 3 seconds. Choosing a reputable provider is crucial for successful scraping.

Using Proxies with Node-Fetch

Node-fetch is a popular library for making HTTP requests in Node.js, with over 20 million weekly downloads. By default, it does not support proxies. However, we can add proxy support using the https-proxy-agent package.

Here‘s a basic example of using a proxy with node-fetch:

const fetch = require(‘node-fetch‘);
const HttpsProxyAgent = require(‘https-proxy-agent‘);

const proxyAgent = new HttpsProxyAgent(‘http://user:pass@123.45.67.89:8080‘);

fetch(‘https://api.ipify.org/?format=json‘, { agent: proxyAgent })
    .then(res => res.json())
    .then(data => console.log(data.ip));

In this example, we create an HttpsProxyAgent with the proxy URL, which may include authentication. We then pass the agent to the fetch options. The printed IP address will be that of the proxy.

For rotating proxies, which provide a different IP on each request, you would typically use an API from your proxy provider rather than a fixed IP address. For example:

const proxyAgent = new HttpsProxyAgent(‘http://user:pass@proxy.example.com:8080‘);

Ethical and Legal Considerations

While web scraping itself is legal in most jurisdictions, some methods used to circumvent blocking may fall into a legal gray area. Before scraping any website, carefully review their terms of service and robots.txt file.

Using residential proxies sourced without user consent is generally considered unethical. Some providers may engage in deceptive practices to obtain residential IPs. Always research your proxy provider and favor those with transparent sourcing.

Avoid scraping websites with sensitive personal data, copyrighted material, or critical infrastructure. Respect user privacy and intellectual property rights.

Beyond Proxies: Avoiding Detection

While proxies are an important tool, they are often not sufficient on their own to avoid blocking. Websites use various techniques to detect scrapers, such as:

  • Checking request headers and user agent
  • Tracking request patterns and rate limits
  • Serving special content to suspected bots
  • Using browser fingerprinting to detect headless browsers

To improve your scraper‘s success rate, consider the following tips:

  1. Rotate user agents and headers – Use a pool of real user agents and vary headers like accept-language to mimic real user requests.

  2. Add random delays – Inserting random pauses between requests makes your scraper traffic look more human and avoids exceeding rate limits.

  3. Use headless browsers – For complex, JavaScript-heavy sites, using a real browser environment like Puppeteer or Playwright can help render dynamic content and pass browser checks.

  4. Avoid honeypot traps – Some websites create hidden links to detect and block bots that interact with them. Inspect the page source and avoid interacting with suspicious elements.

The Growing Web Scraping Market

The web scraping services market is expected to grow from $2.5 billion in 2022 to $6.1 billion by 2027, at a CAGR of 19.5%. This growth is driven by the increasing demand for data-driven insights across industries.

The rise of low-code and no-code scraping tools, as well as fully managed scraping APIs, is making web scraping more accessible to non-technical users. For example, ScrapingBee provides a simple API to scrape any web page, handling proxy rotation, CAPTCHAs, and JavaScript rendering behind the scenes.

Conclusion

Proxies are an essential tool for web scraping at scale, allowing you to distribute requests and avoid IP-based blocking. When using node-fetch for scraping, the https-proxy-agent package provides an easy way to route requests through HTTP and HTTPS proxies.

However, proxy usage alone is not a silver bullet for avoiding detection. Scrapers must also mimic human behavior through techniques like user agent rotation, request pattern variation, and using headless browsers.

As the web scraping market continues to grow, we can expect to see more advanced scraping tools and services that handle the complexities of proxy management and bot detection avoidance. For businesses and researchers, investing in reliable proxy infrastructure and staying up-to-date with scraping best practices will be key to gathering valuable web data at scale.

Did you like this post?

Click on a star to rate it!

Average rating 1 / 5. Vote count: 1

No votes so far! Be the first to rate this post.