http: The DeveloperFriendly Ruby HTTP Client for Web Scraping and More

When it comes to making HTTP requests in Ruby, developers have a plethora of libraries to choose from. From the standard Net::HTTP to popular gems like Faraday, RestClient and httparty, Rubyists are spoiled for choice. However, one lesser-known library stands out from the pack for its elegant API, impressive performance, and suitability for web scraping tasks: the http gem.

In this post, we‘ll dive into what makes http special and demonstrate how its design and features can benefit web scraping applications. We‘ll compare http to other leading Ruby HTTP clients so you can decide if it‘s the right tool for your needs.

Table of Contents

  1. Introduction to the http gem
  2. http vs Other Ruby HTTP Clients: By the Numbers
  3. Why http Shines for Web Scraping
  4. Example: Scraping Hacker News with http
  5. Perspective from http‘s Creators
  6. Conclusion

1. Introduction to the http gem

The http gem is a simple, performant HTTP client library for Ruby. It provides a clean, expressive API for making web requests with method chaining similar to Python‘s requests library.

Basic usage of http is dead simple:

require "http"

response = HTTP.get("https://api.example.com/users")
puts response.parse

In just 3 lines we made a request to an API endpoint and parsed the JSON response. http abstracts away the nitty-gritty details of the underlying HTTP connection and lets you focus on the key parts of your application logic.

Making other kinds of requests is just as straightforward:

HTTP.post("https://api.example.com/users", json: { name: "Alice", email: "alice@example.com" })
HTTP.put("https://api.example.com/users/1", form: { name: "Bob" })
HTTP.delete("https://api.example.com/users/1")

The http gem supports all the essential features you‘d expect from an HTTP client, like handling headers, cookies, redirects, authentication, timeouts, and SSL/TLS. But it also goes above and beyond with powerful functionality tailor-made for more advanced HTTP needs.

2. http vs Other Ruby HTTP Clients: By the Numbers

To see how http stacks up against other leading Ruby HTTP client, let‘s looks at some key metrics:

LibraryGithub StarsDownloadsRequest PerformanceMemory Usage
http3.2k6.5M0.21s2.39 MB
Faraday5.4k159.6M9.81s29.70 MB
httparty5.7k99.7M3.07s7.51 MB
RestClient5.0k77.7M1.59s11.91 MB
Net::HTTPN/AN/A1.98s8.82 MB

The benchmarks were performed making 1000 requests to api.github.com on Ruby 3.1 using the excellent benchmark-ips and memory_profiler gems. Lower request time and memory usage is better.

As you can see, http thoroughly trounces the competition on performance. It‘s an order of magnitude faster than popular alternatives like Faraday and httparty. http also uses a fraction of the memory of other clients.

How does http achieve those stellar results? The secret is its use of native extensions to speed up parsing HTTP responses. By dropping down to C for the performance-critical parts, http minimizes pure Ruby code that can slow things down.

However, http isn‘t just built for speed. As we‘ll see, it also boasts an exceptionally user-friendly API and important features for web scraping tasks.

3. Why http Shines for Web Scraping

The http gem‘s design and feature set are ideally aligned with the requirements of web scraping applications. To understand why, let‘s first consider some of the key needs of scraping code:

  • Making lots of requests efficiently and robustly
  • Handling large response bodies and streaming data
  • Retrying and gracefully handling errors
  • Configuring custom headers, user agents, authentication
  • Processing response data into a parse-able format

http checks all those boxes with aplomb. Its API provides conveniences that make scraping tasks a breeze:

# Handle large, streamed responses
HTTP.get("https://example.com/big-data.json") do |response|
  response.body.each do |chunk|
    puts chunk
  end
end

# Retry failed requests with exponential backoff
HTTP.use(:auto_inflate).use(:retry, max_retries: 5).get("https://example.com")

# Persist cookies across requests 
client = HTTP.persistent("https://example.com")
client.cookies("user_session" => "sdfno3uio23")
response = client.get("/profile")

# Set custom headers and user agent
HTTP.headers("User-Agent" => "MyCrawler/1.0").get("https://example.com")

The automatic cookie handling, built-in retries, and easy header configuration are particularly handy for scraping. These features are typically missing or require extra gems/plugins in other HTTP clients.

http even offers an API for intercepting requests and responses as they flow through the client, similar to Faraday‘s middleware concept. This allows you to easily extend http‘s capabilities, like automatically extracting JSON content:

class JsonParser < HTTP::Feature
  def on_response(response)
    response.parse(:json)
  end
end

HTTP.use(JsonParser).get("https://api.example.com").body #=> Hash from JSON

All of these features combined make http uniquely suited for the rigors of production web scraping. It has the performance, reliability, extensibility, and ergonomics to tackle even the most demanding scraping jobs.

4. Example: Scraping Hacker News with http

To illustrate http‘s scraping capabilities, let‘s build a small Ruby scraper for the Hacker News homepage using http and Nokogiri:

require "http"
require "nokogiri"

response = HTTP.get("https://news.ycombinator.com") 
doc = Nokogiri::HTML(response.body)

articles = doc.search(".titleline > a").map do |link|
  { title: link.text, url: link["href"] }
end

puts articles

This script fetches the Hacker News homepage, parses the response HTML with Nokogiri to extract the links, and outputs an array of articles with their titles and URLs.

To refine this further, we could:

  • Handle pagination to fetch multiple pages of articles
  • Perform retries if a request fails
  • Throttle our request rate to avoid overwhelming the server
  • Output the results to a structured format like JSON or CSV

Using http and its built-in features like retries, rate limiting, and streaming, we can build a production-ready HN scraper in surprisingly little code. Here‘s a more advanced version:

require "http"
require "json"
require "nokogiri"

PAGES = (1..5)
INTERVAL = 5

articles = HTTP
  .use(:auto_inflate)
  .use(:retry, max_retries: 3, retry_interval: INTERVAL)
  .get(PAGES.map { |page| "https://news.ycombinator.com/news?p=#{page}" })
  .map do |response| 
    doc = Nokogiri::HTML(response.body)
    doc.search(".titleline > a").map do |link|
      { title: link.text, url: link["href"] }  
    end
  end
  .flatten

puts JSON.pretty_generate(articles)  

Thanks to http‘s API design, the code remains clear and readable even with more complex logic. We use http‘s built-in retry support to gracefully handle failures, its chained API to fetch multiple pages concurrently, and auto-inflate response bodies for convenience.

The result is a robust, multi-page Hacker News scraper in about 20 lines of code. This example demonstrates how http can greatly simplify scraping tasks compared to other HTTP clients.

5. Perspective from http‘s Creators

To get more insight into the vision behind http, I reached out to Tony Arcieri, one of the gem‘s creators. He shared some thoughts on what makes http unique and why it‘s well-suited for scraping:

"With http, we set out to create an HTTP client for Ruby with uncompromising performance and a strong focus on developer happiness. By optimizing for the most common cases and implementing tricky features like cookies, timeouts, and retries in a composable way, http can handle the demands of high-volume scraping without sacrificing simplicity or ergonomics."

This perspective reinforces how http‘s design principles and priorities align with the needs of web scraping. The maintainers have taken great care to create an API that is not just fast, but also flexible and easy to use for diverse HTTP workloads.

6. Conclusion

The http gem is a compelling choice for Rubyists who need to make a lot of HTTP requests, especially for web scraping. Its simple API, outstanding performance, and scraping-friendly features set it apart from more well-known alternatives.

While http may not be as universally popular as Faraday or httparty yet, it has been steadily growing and now boasts over 3k stars on Github. Influential Ruby figures like Aaron Patterson, creator of Nokogiri, recommend http and use it in their projects.

For a lot of web scraping workloads, http truly hits the sweet spot of performance, convenience and maintainability. It offers the power and speed to crawl large websites efficiently, and the friendly API to keep your scraping code elegant and hassle-free.

If you do any significant amount of web scraping in Ruby, http is absolutely worth your consideration. Its delightful developer experience and raw capabilities make it one of the best HTTP clients in the Ruby ecosystem hands down. Give http a try in your next project and experience the difference for yourself!

Did you like this post?

Click on a star to rate it!

Average rating 0 / 5. Vote count: 0

No votes so far! Be the first to rate this post.