A Web Scraping Expert‘s Guide to Extracting Data with Ruby

Web scraping, the automated extraction of data from websites, is an increasingly important technique powering many business use cases. Companies rely on web scraping for market research, lead generation, competitor monitoring, and aggregating data for machine learning models.

Navi.

According to a 2020 survey by Oxylabs, 52% of companies use web scraping for market research, 49% for competition monitoring, and 40% for lead generation. Web scraping has become a mainstream data gathering method.

As a web scraping expert and consultant, I‘ve seen firsthand how organizations in e-commerce, finance, travel, and other industries use web scraping to gain valuable business insights. One of the best programming languages for scraping the web is Ruby.

Ruby has long been a popular choice for web scraping due to its simple syntax, powerful collection of open source libraries, and strong community. In this guide, I‘ll share the tools, techniques and best practices I‘ve learned over the years for web scraping with Ruby.

Why Use Ruby for Web Scraping?

Ruby has several advantages that make it well-suited for web scraping:

Simple and expressive syntax – Ruby code is concise and readable, making it easy to write and maintain web scraping scripts.
Mature ecosystem of libraries – Ruby has a large collection of open source libraries for making HTTP requests (RestClient, Faraday, Typhoeus), parsing HTML and XML (Nokogiri), and automating web browsers (Capybara, Ferrum).
Strong community – The Ruby community is known for being welcoming and helpful. There are many tutorials, blog posts, and Stack Overflow answers available for learning web scraping with Ruby.
Interactive shell – Ruby‘s Interactive Ruby Shell (IRB) allows you to quickly test code snippets and debug issues, which is very helpful when developing web scrapers.
Full-stack web framework – Ruby on Rails provides a complete web development framework, making it easy to integrate web scraping into larger applications.

Scraping Static Websites with Ruby

Many websites serve static HTML content that can be easily scraped with a simple HTTP request library and an HTML parsing library. In Ruby, the most popular choices are RestClient or Faraday for making requests and Nokogiri for parsing HTML.

Here‘s an example of scraping a static website with RestClient and Nokogiri:

require ‘rest-client‘
require ‘nokogiri‘

url = ‘https://www.example.com/products‘
response = RestClient.get(url)

doc = Nokogiri::HTML(response.body)

products = []
doc.css(‘.product‘).each do |product|
  name = product.at_css(‘.product-name‘).text
  price = product.at_css(‘.product-price‘).text
  products << { name: name, price: price }  
end

puts products

This simple script fetches the HTML of a products listing page, parses out the name and price of each product using CSS selectors, and outputs the data as an array of hashes.

Scaling Up with Parallel Requests

One way to speed up web scraping is to make multiple HTTP requests in parallel instead of sequentially. The Typhoeus library makes it easy to do this in Ruby.

require ‘nokogiri‘
require ‘typhoeus‘

urls = [
  ‘https://www.example.com/products?page=1‘,
  ‘https://www.example.com/products?page=2‘, 
  ‘https://www.example.com/products?page=3‘
]

hydra = Typhoeus::Hydra.new

requests = urls.map do |url|
  request = Typhoeus::Request.new(url, followlocation: true)
  hydra.queue(request)
  request
end

hydra.run

products = []

requests.each do |request|
  doc = Nokogiri::HTML(request.response.body)

  doc.css(‘.product‘).each do |product|
    name = product.at_css(‘.product-name‘).text
    price = product.at_css(‘.product-price‘).text
    products << { name: name, price: price }
  end
end

puts "Found #{products.size} products"

In this example, we first define an array of URLs to scrape. Then we create a Typhoeus::Hydra object which allows us to make parallel requests. We map over the URLs array to create a Typhoeus::Request object for each URL, queue up each request, and store the requests in an array.

Calling hydra.run will fire all the requests in parallel and wait for them to complete. Then we can iterate over our requests array, parse the HTML from each response, and extract the data we want.

Making parallel requests like this can significantly speed up scraping large websites. In my experience, I‘ve seen parallel requests reduce scraping time from hours to minutes for large e-commerce websites with thousands of product pages.

Scraping Dynamic Websites with Ruby

Scraping websites that render content dynamically with JavaScript is more challenging than scraping static sites, but still very possible with Ruby. To scrape dynamic sites, you need to use a headless browser that can execute JavaScript.

The most popular tool for this is Capybara, which provides a high-level API for automating web browsers. Capybara supports multiple browser drivers, but I prefer to use Apparition which is a pure Ruby driver that talks directly to Chrome or Firefox via the DevTools protocol.

Here‘s an example of using Capybara and Apparition to scrape JavaScript-rendered content:

require ‘capybara‘
require ‘capybara/dsl‘
require ‘apparition‘

Capybara.default_driver = :apparition
Capybara.run_server = false

Capybara.app_host = "https://www.example.com"

class DynamicScraper
  include Capybara::DSL

  def get_listings    
    visit "/listings"

    sleep(2) 

    listings = []

    all(".listing").each do |listing|
      name = listing.find(".listing-name").text
      address = listing.find(".listing-address").text
      listings << { name: name, address: address }
    end

    listings
  end
end

puts DynamicScraper.new.get_listings

In this example, we configure Capybara to use Apparition as the default driver and set the app_host to the base URL we want to scrape.

Inside the DynamicScraper class, we visit the listings path and use sleep to wait a couple seconds for the content to render. Then we use Capybara‘s all and find methods to locate elements using CSS selectors, extract the data, and return it.

Using a headless browser like this allows the scraper to see the fully-rendered DOM after JavaScript execution, just like a real user would.

Tips for Scraping Dynamic Websites

Scraping dynamic websites can be tricky. Here are a few tips I‘ve learned:

Wait for content to load – Dynamic websites often fetch data asynchronously, so you need to wait for the content to be present before trying to scrape it. Use explicit waits like find instead of implicit waits like sleep when possible.
Monitor requests in the network panel – Open the Network panel in your browser‘s developer tools to see what requests are being made as you interact with the page. You might be able to extract data directly from XHR or fetch requests instead of rendering the full page.
Reverse-engineer the API – Many dynamic sites load data from internal JSON APIs. If you can figure out the schema of the API by inspecting network requests, you can directly make requests to the API endpoint to get structured data instead of scraping the rendered HTML.
Handle infinite scroll – Some sites use infinite scroll to load more content as the user scrolls down the page. To scrape these sites, identify the element that triggers the next page load and use Capybara‘s execute_script method to simulate scrolling.

Storing Scraped Data

Once you‘ve extracted data from a website, you‘ll likely want to store it in structured format for later analysis. Some common options are saving to a CSV file, inserting into a SQL database like PostgreSQL, or storing in a NoSQL database like MongoDB.

Here‘s an example of storing scraped data in a PostgreSQL database using the Ruby Sequel gem:

require ‘sequel‘

DB = Sequel.connect(‘postgres://user:password@host:port/database_name‘)

# Define a products table
DB.create_table? :products do
  primary_key :id
  String :name 
  Float :price
end

# Insert each scraped product into the database
DB[:products].insert(products)

Storing scraped data in a database makes it easier to aggregate, search, and analyze the data later. It also allows you to run scrapers on a schedule and incrementally update the data over time.

Challenges and Best Practices

Web scraping isn‘t without its challenges. As a web scraping expert, I‘ve learned several best practices for overcoming these challenges:

Be respectful – Don‘t overload servers with too many requests too quickly. Add delays and limit concurrent requests. Respect robots.txt files and only scrape content that is publicly available and allowed.
Handle errors gracefully – Websites change frequently, so expect your scrapers to break. Use retries and exponential backoff to handle network errors and rate limiting. Log errors and set up alerts to notify you when a scraper fails.
Rotate IP addresses and user agents – Many websites try to block scrapers by detecting patterns like a high number of requests from a single IP address or a suspicious user agent string. Use a pool of proxy IPs and rotate user agents to avoid getting blocked.
Render JavaScript efficiently – Executing JavaScript in a headless browser is slower than making normal HTTP requests. Only use a headless browser when absolutely necessary, and limit the amount of content rendered to only what you need to scrape.
Avoid honeypot traps – Some websites set up honeypot links that are hidden from users but detectable by scrapers. If you follow these links, the site will know you are a scraper. Inspect the page‘s DOM to avoid accidentally triggering honeypots.
Monitor and maintain scrapers – Web scrapers require ongoing maintenance as websites change. Set up monitoring and alerts to detect when a scraper fails or returns unexpected results. Schedule regular maintenance to update selectors and fix broken scrapers.

Legality and Ethics

Web scraping is a contentious topic and the laws around it are complex. In general, courts have held that scraping publicly accessible data is legal. Two recent rulings, hiQ Labs v. LinkedIn and Sandvig v. Barr, affirmed this position.

However, scraping can potentially run afoul of laws like the Computer Fraud and Abuse Act (CFAA) in the US and the General Data Protection Regulation (GDPR) in Europe if sensitive personal data is scraped or if scrapers bypass technical restrictions to access non-public content.

Ultimately, whether scraping is legal depends on what is being scraped, how it is being scraped, and what the scraped data is used for. When in doubt, consult a lawyer.

From an ethical perspective, I believe scrapers have a responsibility to be good internet citizens. Don‘t steal proprietary content, don‘t overload servers with unreasonable amounts of requests, and don‘t try to circumvent a site‘s attempts to restrict scraping. Scrape respectfully and use scraped data responsibly.

Conclusion

Web scraping is a powerful technique for gathering data from the internet, and Ruby is an excellent language for building scrapers. Whether you are scraping static or dynamic content, Ruby‘s simple syntax and wealth of libraries make it easy to extract, transform, and store data.

Some key points to remember:

Use HTTP libraries like RestClient and Faraday to fetch static content
Use Nokogiri to parse HTML and CSS selectors to extract data
Use a headless browser like Capybara and Apparition to scrape dynamic JavaScript-rendered content
Store scraped data in a structured format like a CSV file or database for later analysis
Follow best practices like respecting robots.txt, handling errors, and rotating IPs to avoid getting blocked
Be aware of the legal and ethical implications of scraping and only scrape what is allowed and necessary

With the right approach and tools, web scraping with Ruby can provide immense value for businesses and individuals alike. I hope this guide has been helpful for you to learn more about scraping the web with Ruby!