Web Scraping with Scala: Insights from the Experts

Web scraping, the automated extraction of data from websites, is an increasingly valuable skill in our data-driven world. As the web has grown to over 1.8 billion websites[1], the amount of data available to be scraped is staggering. The global web scraping services market is expected to grow from $1.6B in 2022 to $6.5B by 2027, a 32.6% annual growth rate[2].

Scala, with its blend of object-oriented and functional programming paradigms, makes an excellent choice for writing robust and maintainable web scrapers. Its concise syntax, static typing, and seamless Java interoperability allow for writing expressive and performant scraping code. In this comprehensive guide, we‘ll dive deep into the world of web scraping using Scala.

Parsing HTML with jsoup

jsoup is a popular Java library for working with HTML that can be easily used in Scala projects. After adding the dependency "org.jsoup" % "jsoup" % "1.15.4", we can fetch and parse the HTML of a webpage:

import org.jsoup.Jsoup
import org.jsoup.nodes.Document

val doc: Document = Jsoup.connect("https://en.wikipedia.org/").get()
println(doc.title())

jsoup provides methods to traverse and extract data from the parsed HTML using CSS selectors:

val newsLinks = doc.select("#mp-itn b a")
newsLinks.forEach(link => {
  println(s"${link.attr("title")}: ${link.attr("abs:href")}")
})

Here we select all <a> elements under <b> elements within the "In the news" section (id="mp-itn") and print their title and absolute href attributes.

One neat jsoup feature is the ability to parse HTML fragments, useful for scraping web APIs that return partial HTML:

val html = """
  <ul>
    <li><a href="/item1">Item 1</a></li>
    <li><a href="/item2">Item 2</a></li>
  </ul>
"""

val fragment: Document = Jsoup.parseBodyFragment(html)
fragment.select("a").forEach(link => {
  println(s"Text: ${link.text}, Href: ${link.attr("href")}")  
})

While jsoup makes basic HTML extraction quite straightforward, more complex scraping tasks can benefit from a higher-level library built on top of jsoup, like Scala Scraper.

Declarative Scraping with Scala Scraper

Scala Scraper is a library that provides a concise and type-safe DSL for web scraping, built on top of jsoup. To use it, add the dependency "net.ruippeixotog" %% "scala-scraper" % "3.0.0", then import the DSL and browser:

import net.ruippeixotog.scalascraper.browser.JsoupBrowser
import net.ruippeixotog.scalascraper.dsl.DSL._
import net.ruippeixotog.scalascraper.dsl.DSL.Extract._

We can now fetch and parse webpages using the JsoupBrowser:

val doc = JsoupBrowser().get("https://en.wikipedia.org/")

Extracting content becomes more declarative using the Scala Scraper DSL:

val featuredArticleText: String = doc >> text("#mp-tfa p")
val featuredArticleLink: String = doc >> attr("href")("#mp-tfa a")

The >> operator drills down into the document, text extracts the combined text of selected elements, attr("href") extracts the href attribute, and #mp-tfa p and #mp-tfa a are the CSS selectors targeting the desired elements.

Oftentimes we want to extract multiple related pieces of data from each element, like the title and URL of links. Scala Scraper makes this concise using for-comprehensions:

val items = doc >> elementList("#mp-dyk ul li a") map { item =>
  (item >> attr("href"), item >> text)
}

This extracts a list of tuples containing the href and text of each "Did you know" link. Scala Scraper also supports recursively extracting deeply nested elements.

While jsoup and Scala Scraper are excellent for scraping static webpages, many modern sites heavily rely on JavaScript to dynamically load content, necessitating the use of a headless browser like Selenium.

Dynamic Scraping with Selenium

Selenium is a tool for automating web browsers, commonly used for testing but also applicable to web scraping. It can execute JavaScript, fill out and submit forms, take screenshots, and more.

Selenium requires a browser-specific driver executable (e.g. chromedriver or geckodriver) to be available on the system PATH. After adding the dependency "org.seleniumhq.selenium" % "selenium-java" % "4.8.3", we can start a browser instance:

import org.openqa.selenium.firefox.FirefoxDriver
val driver = new FirefoxDriver()
driver.get("https://en.wikipedia.org/")

Elements can be located using the findElement and findElements methods:

import org.openqa.selenium.By

val trendingArticleTitle = driver.findElement(By.cssSelector("#mp-tfa .mp-h2")).getText
val recentDeaths = driver.findElements(By.cssSelector("#mp-itn b a"))

Where Selenium really shines is interacting with dynamic page elements. For example, to search Wikipedia and extract the result links:

driver.findElement(By.id("searchInput")).sendKeys("functional programming")
driver.findElement(By.id("searchButton")).click()

val searchResults = driver.findElements(By.cssSelector("#mw-content-text .mw-search-results .mw-search-result-heading a"))
searchResults.forEach(result => println(s"${result.getText} - ${result.getAttribute("href")}"))

This enters a search term, clicks the search button to load and navigate to the results page, then extracts the result titles and URLs.

Pagination is another common obstacle when scraping that Selenium can handle. Here‘s an example of scraping search results across multiple pages:

var nextPage = driver.findElement(By.cssSelector(".mw-nextlink"))
while (nextPage != null) {
  val pageResults = driver.findElements(By.cssSelector(".mw-search-results .mw-search-result-heading a"))
  pageResults.forEach(result => println(s"${result.getText} - ${result.getAttribute("href")}"))

  try {
    nextPage = driver.findElement(By.cssSelector(".mw-nextlink"))
    nextPage.click()
  } catch {
    case _: NoSuchElementException => nextPage = null
  }
}

This scrapes each page of search results, clicking the next page link until reaching the last page.

A problem with scraping using an automated browser is it‘s slower and more resource intensive than using an HTML parsing library directly. For efficiency, we can use a headless browser mode:

import org.openqa.selenium.firefox.FirefoxOptions

val options = new FirefoxOptions()
options.addArguments("--headless")
val driver = new FirefoxDriver(options)

Headless mode runs the browser without a visible UI, improving speed and allowing scrapers to run on servers without a display.

Another challenge with scraping is anti-bot measures like CAPTCHAs. While Selenium can‘t automatically solve CAPTCHAs, there are services like 2captcha and DeathByCaptcha that provide APIs to programmatically solve them.

Ethical Scraping and Best Practices

With great scraping power comes great responsibility. When scraping, it‘s crucial to respect website owners and abide by ethical principles:

  • Check the robots.txt file and respect any disallowed pages. The Robots Exclusion Standard specifies which parts of a site web crawlers are permitted to access. In Scala, you can parse robots.txt using a library like webgraph-commons.

  • Follow the site‘s terms of service and only scrape content you have permission for. Avoid scraping copyrighted data or personal information.

  • Limit your crawl rate and introduce delays between requests to avoid overwhelming servers. Scraped content should be cached so the same data isn‘t repeatedly fetched.

  • Use a descriptive user agent string identifying your scraper so site owners can contact you if needed.

  • Handle errors gracefully as site layouts can unexpectedly change and break your parsers. Scrapers require ongoing maintenance.

Many sites are now rendered predominantly with JavaScript frameworks like React, Angular, or Vue, making them more challenging targets for scraping. For such single page applications, tools designed specifically for scraping, like Puppeteer or Playwright, may be preferable over Selenium.

The Future of Web Scraping

As the web evolves, so does web scraping. With the rise of AI-generated content and dynamic, JavaScript-heavy web applications, scraping is becoming more challenging and complex.

Headless browsers are increasingly crucial for scraping modern sites. Tools like Puppeteer and Playwright provide higher-level APIs for automating browsers and handling SPAs. We‘re likely to see further development of libraries built on top of headless browsers, offering declarative scraping APIs.

Machine learning techniques like natural language processing and computer vision are being applied to scraping for tasks like entity recognition and image classification. As AI models improve, we may see the emergence of "intelligent scrapers" that can automatically identify and extract relevant data from pages.

While web scraping is a powerful technique, it‘s important to consider the broader societal implications. As more data is scraped and aggregated, issues around data privacy, ownership, and fair use will become increasingly pressing. Policymakers and the tech community will need to work together to develop ethical frameworks for scraping.

Conclusion

Web scraping with Scala offers a wealth of possibilities for gathering data from across the internet. Whether you‘re a data scientist, business analyst, or software engineer, being able to programmatically extract web data is an immensely valuable skill.

In this guide, we‘ve explored several libraries for scraping in Scala:

  • jsoup for basic HTML parsing and extraction
  • Scala Scraper for more declarative and type-safe scraping
  • Selenium for automating browsers and scraping dynamic sites

We‘ve also discussed best practices for writing respectful and robust scrapers, and touched on the future of scraping in an increasingly AI-driven web.

Ultimately, the best scraping approach depends on the specific site and data you‘re targeting. By understanding the range of tools and techniques available, you can mix and match to build scrapers well-suited to your use case.

As you venture out into the world of web scraping, remember to scrape ethically, respect website owners, and abide by legal and privacy considerations. Happy scraping!

References

[1] https://siteefy.com/how-many-websites-are-there/
[2] https://www.marketwatch.com/press-release/at-326-cagr-web-scraping-services-market-size-worth-usd-6482-million-by-2027-2022-10-06

Did you like this post?

Click on a star to rate it!

Average rating 0 / 5. Vote count: 0

No votes so far! Be the first to rate this post.