If you do any kind of web scraping, you know how important proxies are. They allow you to mask your real IP address, avoid rate limiting and IP bans, and collect data at scale. Paired with the popular Faraday HTTP client gem, you can easily integrate proxies into your Ruby web scraping pipeline.
In this tutorial, we‘ll walk through all the ways to configure proxies with Faraday and share best practices for using them effectively. We‘ll cover:
- Using basic HTTP/HTTPS proxies
- Authenticating with username/password
- Setting proxies via environment variables
- Implementing proxy rotation
- Handling errors and retrying requests
- Proxy services that work well with Faraday
By the end, you‘ll be able to supercharge your web scraping projects with proxies. Let‘s get started!
Why Faraday?
Faraday is a lightweight, flexible HTTP client library for Ruby. It uses a modular middleware architecture, allowing you to easily customize requests and swap out adapters. Some key benefits:
- Simple, intuitive API
- Supports multiple HTTP adapters (Net::HTTP, Excon, Typhoeus, etc.)
- Extensible with custom middleware
- Handles request/response encoding, headers, params seamlessly
- Well-maintained and battle-tested
While Ruby has other great HTTP libraries like HTTParty and Rest-Client, Faraday‘s adaptability makes it ideal for advanced usage with proxies.
Setting up a basic HTTP/HTTPS proxy
The quickest way to add a proxy to Faraday is passing in the :proxy
option when initializing a connection:
proxy = ‘http://12.34.56.78:8080‘
conn = Faraday.new(‘http://httpbin.org‘, proxy: proxy)
resp = conn.get ‘ip‘
puts resp.body
# prints the proxy IP
This forwards requests through the specified proxy server. It couldn‘t get much easier!
The :proxy
option also supports the http
and https
keys for proxying HTTP and HTTPS connections respectively:
proxy = {
http: ‘http://12.34.56.78:8080‘,
https: ‘http://12.34.56.79:8080‘
}
conn = Faraday.new(‘https://httpbin.org‘, proxy: proxy)
Authenticating with username/password
Many proxy servers require authentication to prevent unauthorized access. With Faraday, you can embed the credentials directly in the proxy URL:
proxy_url = ‘http://username:password@12.34.56.78:8080‘
conn = Faraday.new(‘http://httpbin.org‘, proxy: proxy_url)
Alternatively, use the user
and password
keys:
proxy = {
uri: ‘http://12.34.56.78:8080‘,
user: ‘username‘,
password: ‘password‘
}
conn = Faraday.new(‘http://httpbin.org‘, proxy: proxy)
Make sure to keep your proxy credentials secure and never commit them to version control.
Setting proxy via environment variables
Faraday respects the HTTP_PROXY
, HTTPS_PROXY
, and NO_PROXY
environment variables. If set, these will be used as the default proxy configuration.
export HTTP_PROXY=http://12.34.56.78:8080
export HTTPS_PROXY=http://12.34.56.79:8080
Then in your Ruby code, initialize Faraday without any proxy options:
conn = Faraday.new(‘http://httpbin.org‘)
The connection will automatically use the proxy specified in the environment variable. This is very handy for specifying different proxies per environment.
You can also disable proxying for certain hosts using NO_PROXY
:
export NO_PROXY=localhost,127.0.0.1,.example.com
Rotating proxies
To minimize the risk of getting blocked while scraping, it‘s a good idea to spread out your requests across multiple proxy servers. You can implement simple round-robin proxy rotation in Faraday like this:
proxies = [
‘http://12.34.56.78:8080‘,
‘http://23.45.67.89:8080‘,
‘http://34.56.78.90:8080‘
]
conn = Faraday.new(‘http://httpbin.org‘)
proxies.cycle do |proxy|
resp = conn.get(‘ip‘, proxy: proxy)
puts resp.body
sleep(1) # wait between requests
end
Each request will use the next proxy in the list. Once it reaches the end, it will loop back to the first one.
You can also weight proxies based on their reliability or performance:
proxies = [
{ url: ‘http://fast_proxy.com‘, weight: 3 },
{ url: ‘http://slow_proxy.com‘, weight: 1 }
]
weighted_proxies = proxies.flat_map do |p|
[p[:url]] * p[:weight]
end
conn = Faraday.new(‘http://httpbin.org‘)
weighted_proxies.cycle do |proxy|
resp = conn.get(‘ip‘, proxy: proxy)
sleep(1)
end
This will use fast_proxy
3 times as often as slow_proxy
. Fiddle with the weightings to find the optimal ratio for your needs.
Handling errors and retrying
Proxies can be unreliable – they go down, get banned, or become unresponsive. To keep your scraper humming along, you need to gracefully handle errors and retry with a different proxy.
Here‘s a robust pattern for retrying failed requests:
MAX_RETRIES = 3
def with_retry(&block)
tries = 0
begin
yield
rescue Faraday::Error => e
tries += 1
if tries <= MAX_RETRIES
sleep(1)
retry
else
raise e
end
end
end
proxies = [‘http://12.34.56.78:8080‘, ...]
conn = Faraday.new(‘http://example.com‘)
proxies.shuffle.each do |proxy|
begin
resp = with_retry { conn.get(‘/‘, proxy: proxy) }
break if resp.success?
rescue Faraday::Error
next # try next proxy
end
end
This code will try each proxy up to MAX_RETRIES
times before moving on to the next one. It shuffles the proxy list to avoid hitting the same one repeatedly.
Adjust the number of retries and the sleep duration to match your proxy quality and target website‘s rate limits.
Proxy services
While you can source free proxies from various online lists, they tend to be slow and unreliable. For serious scraping, you‘re better off paying for dedicated proxies or a proxy management service.
Some reputable proxy providers with Faraday support:
- Smartproxy – backconnect and rotating datacenter proxies
- Oxylabs – large pool of datacenter and residential proxies
- Luminati – peer-to-peer residential proxy network with a simple REST API
- ProxyMesh – worldwide network of anonymous proxies
Most of these services work out of the box with Faraday‘s basic proxy configuration. Some also provide gems that streamline proxy auth and rotation.
Before committing to a provider, test their proxies against your target sites to ensure they aren‘t blacklisted. Also consider the proxy location, bandwidth limits, and subnet diversity.
Conclusion
With its flexibility and strong proxy support, Faraday is one of the best tools for web scraping in Ruby. In this tutorial, we covered everything you need to know about configuring proxies with Faraday:
- Setting a basic proxy URL
- Authenticating with username/password
- Using environment variables
- Rotating proxies
- Handling errors and retrying requests
- Choosing a proxy service
By applying these techniques and best practices, you can keep your scrapers running smoothly and avoid detection.
Just remember – be respectful and limit your request rate. Only collect data that is publicly available and allowed by the site‘s terms of service. With great scraping power comes great responsibility!
Further reading:
- Faraday documentation
- Scraping tips from TOSCRAPE
- Compare proxy services on ProxyRack
- ScrapingBee‘s guide to web scraping without getting blocked