How to Use Proxies with Ruby and Faraday for Web Scraping

If you do any kind of web scraping, you know how important proxies are. They allow you to mask your real IP address, avoid rate limiting and IP bans, and collect data at scale. Paired with the popular Faraday HTTP client gem, you can easily integrate proxies into your Ruby web scraping pipeline.

In this tutorial, we‘ll walk through all the ways to configure proxies with Faraday and share best practices for using them effectively. We‘ll cover:

  • Using basic HTTP/HTTPS proxies
  • Authenticating with username/password
  • Setting proxies via environment variables
  • Implementing proxy rotation
  • Handling errors and retrying requests
  • Proxy services that work well with Faraday

By the end, you‘ll be able to supercharge your web scraping projects with proxies. Let‘s get started!

Why Faraday?

Faraday is a lightweight, flexible HTTP client library for Ruby. It uses a modular middleware architecture, allowing you to easily customize requests and swap out adapters. Some key benefits:

  • Simple, intuitive API
  • Supports multiple HTTP adapters (Net::HTTP, Excon, Typhoeus, etc.)
  • Extensible with custom middleware
  • Handles request/response encoding, headers, params seamlessly
  • Well-maintained and battle-tested

While Ruby has other great HTTP libraries like HTTParty and Rest-Client, Faraday‘s adaptability makes it ideal for advanced usage with proxies.

Setting up a basic HTTP/HTTPS proxy

The quickest way to add a proxy to Faraday is passing in the :proxy option when initializing a connection:

proxy = ‘http://12.34.56.78:8080‘
conn = Faraday.new(‘http://httpbin.org‘, proxy: proxy)

resp = conn.get ‘ip‘
puts resp.body 
# prints the proxy IP

This forwards requests through the specified proxy server. It couldn‘t get much easier!

The :proxy option also supports the http and https keys for proxying HTTP and HTTPS connections respectively:

proxy = { 
  http: ‘http://12.34.56.78:8080‘,
  https: ‘http://12.34.56.79:8080‘
}

conn = Faraday.new(‘https://httpbin.org‘, proxy: proxy)

Authenticating with username/password

Many proxy servers require authentication to prevent unauthorized access. With Faraday, you can embed the credentials directly in the proxy URL:

proxy_url = ‘http://username:password@12.34.56.78:8080‘
conn = Faraday.new(‘http://httpbin.org‘, proxy: proxy_url)

Alternatively, use the user and password keys:

proxy = {
  uri: ‘http://12.34.56.78:8080‘,
  user: ‘username‘,
  password: ‘password‘  
}

conn = Faraday.new(‘http://httpbin.org‘, proxy: proxy)

Make sure to keep your proxy credentials secure and never commit them to version control.

Setting proxy via environment variables

Faraday respects the HTTP_PROXY, HTTPS_PROXY, and NO_PROXY environment variables. If set, these will be used as the default proxy configuration.

export HTTP_PROXY=http://12.34.56.78:8080
export HTTPS_PROXY=http://12.34.56.79:8080

Then in your Ruby code, initialize Faraday without any proxy options:

conn = Faraday.new(‘http://httpbin.org‘)

The connection will automatically use the proxy specified in the environment variable. This is very handy for specifying different proxies per environment.

You can also disable proxying for certain hosts using NO_PROXY:

export NO_PROXY=localhost,127.0.0.1,.example.com

Rotating proxies

To minimize the risk of getting blocked while scraping, it‘s a good idea to spread out your requests across multiple proxy servers. You can implement simple round-robin proxy rotation in Faraday like this:

proxies = [
  ‘http://12.34.56.78:8080‘,
  ‘http://23.45.67.89:8080‘, 
  ‘http://34.56.78.90:8080‘
]

conn = Faraday.new(‘http://httpbin.org‘) 

proxies.cycle do |proxy|
  resp = conn.get(‘ip‘, proxy: proxy)
  puts resp.body
  sleep(1) # wait between requests
end

Each request will use the next proxy in the list. Once it reaches the end, it will loop back to the first one.

You can also weight proxies based on their reliability or performance:

proxies = [
  { url: ‘http://fast_proxy.com‘, weight: 3 },
  { url: ‘http://slow_proxy.com‘, weight: 1 }  
]

weighted_proxies = proxies.flat_map do |p| 
  [p[:url]] * p[:weight]  
end

conn = Faraday.new(‘http://httpbin.org‘)

weighted_proxies.cycle do |proxy|
  resp = conn.get(‘ip‘, proxy: proxy)
  sleep(1)
end

This will use fast_proxy 3 times as often as slow_proxy. Fiddle with the weightings to find the optimal ratio for your needs.

Handling errors and retrying

Proxies can be unreliable – they go down, get banned, or become unresponsive. To keep your scraper humming along, you need to gracefully handle errors and retry with a different proxy.

Here‘s a robust pattern for retrying failed requests:

MAX_RETRIES = 3

def with_retry(&block)
  tries = 0
  begin
    yield
  rescue Faraday::Error => e
    tries += 1
    if tries <= MAX_RETRIES
      sleep(1)
      retry
    else
      raise e 
    end
  end
end

proxies = [‘http://12.34.56.78:8080‘, ...]
conn = Faraday.new(‘http://example.com‘) 

proxies.shuffle.each do |proxy|
  begin  
    resp = with_retry { conn.get(‘/‘, proxy: proxy) }
    break if resp.success?
  rescue Faraday::Error
    next # try next proxy  
  end
end

This code will try each proxy up to MAX_RETRIES times before moving on to the next one. It shuffles the proxy list to avoid hitting the same one repeatedly.

Adjust the number of retries and the sleep duration to match your proxy quality and target website‘s rate limits.

Proxy services

While you can source free proxies from various online lists, they tend to be slow and unreliable. For serious scraping, you‘re better off paying for dedicated proxies or a proxy management service.

Some reputable proxy providers with Faraday support:

  • Smartproxy – backconnect and rotating datacenter proxies
  • Oxylabs – large pool of datacenter and residential proxies
  • Luminati – peer-to-peer residential proxy network with a simple REST API
  • ProxyMesh – worldwide network of anonymous proxies

Most of these services work out of the box with Faraday‘s basic proxy configuration. Some also provide gems that streamline proxy auth and rotation.

Before committing to a provider, test their proxies against your target sites to ensure they aren‘t blacklisted. Also consider the proxy location, bandwidth limits, and subnet diversity.

Conclusion

With its flexibility and strong proxy support, Faraday is one of the best tools for web scraping in Ruby. In this tutorial, we covered everything you need to know about configuring proxies with Faraday:

  • Setting a basic proxy URL
  • Authenticating with username/password
  • Using environment variables
  • Rotating proxies
  • Handling errors and retrying requests
  • Choosing a proxy service

By applying these techniques and best practices, you can keep your scrapers running smoothly and avoid detection.

Just remember – be respectful and limit your request rate. Only collect data that is publicly available and allowed by the site‘s terms of service. With great scraping power comes great responsibility!

Further reading:

Did you like this post?

Click on a star to rate it!

Average rating 0 / 5. Vote count: 0

No votes so far! Be the first to rate this post.