Web Scraping in Golang with Colly: The Comprehensive Guide

Web scraping, the automated extraction of data from websites, is an increasingly essential skill for developers and data professionals. As the volume of data on the web continues to grow exponentially, the ability to efficiently collect and parse that data has become critical for a wide range of applications, from price monitoring to lead generation to sentiment analysis.

Consider these statistics:

  • The amount of data created each day is expected to reach 463 exabytes globally by 2025 [Source]
  • Web scraping is used by 55% of business leaders for gathering competitor intelligence and 52% for lead generation [Source]
  • The web scraping market is projected to grow from $5.6 billion in 2022 to $9.7 billion by 2027, at a CAGR of 11.6% [Source]

Clearly, web scraping is a high-value skill, and choosing the right tools and techniques is key to scraping efficiently and effectively. While there are many programming languages that support web scraping, Golang has emerged as a top choice in recent years for several reasons:

  1. Performance: Go is designed for speed and efficiency, with a lightweight runtime and native concurrency support. This makes it well-suited for I/O bound tasks like web scraping.

  2. Simplicity: Go has a clean and expressive syntax that is easy to read and write. This simplicity reduces development time and maintenance costs.

  3. Robustness: Go has strong typing and built-in error handling, which helps catch bugs early and keeps long-running scraping jobs stable.

  4. Concurrency: Go‘s goroutines and channels make it easy to write concurrent scrapers that can process multiple pages in parallel, dramatically improving performance.

To make web scraping in Go even more powerful and productive, the open source community has created a number of libraries and frameworks. One of the most popular is Colly.

Introducing Colly

Colly is a lightweight, idiomatic web scraping and crawling framework for Golang. It provides a clean and expressive API for traversing web pages, extracting structured data, and controlling the flow of requests.

Some key features of Colly include:

  • Automatic cookie and session handling
  • Parallel scraping with configurable limits
  • Caching and request delays to avoid overloading servers
  • Automatic handling of character encoding, redirection, and compression
  • Intelligent handling of robot.txt policies
  • Integration with popular storage backends like MongoDB and Redis

Colly has quickly gained adoption and now has over 18,000 stars on GitHub. It‘s used by companies like Algolia, Tencent, and Instagram to power their web scraping projects.

Getting Started with Colly

To start using Colly, you‘ll first need to install Go and set up a new project. Refer to the official Golang installation guide for detailed instructions.

Once you have Go set up, create a new directory for your scraping project and initialize a new module:

$ mkdir goscrapers && cd goscrapers
$ go mod init github.com/yourusername/goscrapers

Next, install Colly using Go‘s built-in dependency manager:

$ go get -u github.com/gocolly/colly

You‘re now ready to start building scrapers!

Basic Example: Scraping Page Titles

As a first example, let‘s write a program that scrapes the titles from a list of popular news websites. Create a new file called news_scraper.go with the following code:

package main

import (
    "fmt"
    "github.com/gocolly/colly"
)

func main() {
    // Create a new collector
    c := colly.NewCollector(
        // Visit only domains: hackerspaces.org, wiki.hackerspaces.org
        colly.AllowedDomains("hackerspaces.org", "wiki.hackerspaces.org"),
    )

    // On every a element which has href attribute call callback
    c.OnHTML("a[href]", func(e *colly.HTMLElement) {
        link := e.Attr("href")
        // Print link
        fmt.Printf("Link found: %q\n", link)
        // Visit link found on page
        // Only those links are visited which are in AllowedDomains
        c.Visit(e.Request.AbsoluteURL(link))
    })

    // Before making a request print "Visiting ..."
    c.OnRequest(func(r *colly.Request) {
        fmt.Println("Visiting", r.URL.String())
    })

    // Start scraping on https://hackerspaces.org
    c.Visit("https://hackerspaces.org/")
}

This code does the following:

  1. Creates a new Collector instance with colly.NewCollector(), specifying which domains are allowed to be scraped.

  2. Registers a callback on the OnHTML event to be triggered whenever the collector encounters an <a> element with an href attribute. The callback function extracts the URL from the href attribute and instructs the collector to visit that URL if it‘s within the allowed domains.

  3. Registers a callback on the OnRequest event to print a message to the console each time the collector is about to make an HTTP request.

  4. Starts the scraping process by calling c.Visit() with the initial URL.

When you run this program, you should see output similar to:

Visiting https://hackerspaces.org/
Link found: "/about/"
Visiting https://hackerspaces.org/about/
Link found: "/"
Link found: "/spaces/"
Visiting https://hackerspaces.org/spaces/
Link found: "/events/"
Visiting https://hackerspaces.org/events/
Link found: "/news/"
Visiting https://hackerspaces.org/news/
...

This demonstrates the basic flow of using Colly – create a collector, register callbacks to handle different events, and start the collector on an initial URL. The collector will then follow links and invoke your callbacks as it crawls the target site.

Advanced Example: Scraping and Storing Articles

Of course, just printing out URLs isn‘t very useful. In most cases, you‘ll want to extract structured data from the pages you scrape and store it for later analysis.

Let‘s look at a more advanced example that scrapes news articles from a site, extracts their titles, summaries, and publication dates, and stores the results in a PostgreSQL database.

First, install the necessary PostgreSQL driver:

$ go get -u github.com/lib/pq

Then create a new file called article_scraper.go with the following code:

package main

import (
    "database/sql"
    "fmt"
    "log"
    "strings"
    "time"

    "github.com/gocolly/colly"
    _ "github.com/lib/pq"
)

type Article struct {
    Title       string    `json:"title"`
    Summary     string    `json:"summary"`
    PublishedAt time.Time `json:"publishedAt"`
}

func main() {
    // Connect to PostgreSQL DB
    db, err := sql.Open("postgres", "postgres://username:password@host/database")
    if err != nil {
        log.Fatal(err)
    }
    defer db.Close()

    // Create a new collector
    c := colly.NewCollector(
        colly.AllowedDomains("www.example.com", "example.com"),
    )

    articles := make([]Article, 0)

    // Extract article details
    c.OnHTML("div.article", func(e *colly.HTMLElement) {
        article := Article{}
        article.Title = e.ChildText("h2.title")
        article.Summary = e.ChildText("p.summary")

        publishedAtStr := e.ChildText("span.published-at")
        publishedAt, err := time.Parse("2006-01-02", publishedAtStr)
        if err == nil {
            article.PublishedAt = publishedAt
        }

        articles = append(articles, article)
    })

    // Scrape next page
    c.OnHTML("a.next", func(e *colly.HTMLElement) {
        nextURL := e.Request.AbsoluteURL(e.Attr("href"))
        c.Visit(nextURL)
    })

    // Insert articles into DB
    c.OnScraped(func(r *colly.Response) {
        for _, article := range articles {
            _, err := db.Exec(`
                INSERT INTO articles(title, summary, published_at)
                VALUES($1, $2, $3)`,
                article.Title,
                article.Summary,
                article.PublishedAt,
            )
            if err != nil {
                log.Println(err)
            }
        }

        // Reset articles slice
        articles = make([]Article, 0)
    })

    // Start scraping
    startURL := "https://www.example.com/articles?page=1"
    c.Visit(startURL)
}

This program introduces a few new concepts:

  1. We define a struct Article to represent the data we want to extract from each article page.

  2. In the OnHTML callback for "div.article", we extract the title, summary, and published date for each article and append a new Article instance to the articles slice.

  3. We add a new OnHTML callback for "a.next" to handle pagination. Whenever a "next page" link is encountered, we extract its URL and visit it to continue scraping.

  4. We register an OnScraped callback to insert the scraped articles into the database after each page is processed. This ensures we save our progress incrementally in case of failures.

  5. Instead of hardcoding a single start URL, we set startURL to the first page of the article listing. The OnHTML("a.next") callback will take care of visiting subsequent pages.

This example demonstrates how Colly can be used to scrape data across multiple pages and store the results in a structured format.

Note: The SQL insertion logic should ideally be refactored into a separate function or method for better code organization and reusability.

Scraping Responsibly: Best Practices and Ethics

While web scraping is a powerful tool, it‘s important to use it responsibly and ethically. Here are some best practices to keep in mind:

  1. Respect robots.txt: Always check a site‘s robots.txt file and obey its rules. Colly makes this easy with the Robotstxt and RobotsTxtDisallowed options.

  2. Limit concurrency: Avoid making too many concurrent requests to a single server, which can look like a denial-of-service attack. Colly‘s Limit option allows you to set a maximum number of concurrent requests.

  3. Introduce delays: Add a delay between requests to avoid overloading servers. You can do this with Colly‘s Delay and RandomDelay options.

  4. Set a user agent: Use a descriptive user agent string that includes your contact information. This allows site owners to reach out if there‘s an issue. Set this with Colly‘s UserAgent option.

  5. Cache results: Store scraped data locally to avoid repeatedly hitting servers for the same information. Colly provides a caching middleware for this purpose.

  6. Don‘t scrape sensitive info: Avoid scraping personal data, copyrighted material, or other sensitive information without permission.

By following these guidelines, you can ensure your scrapers are efficient, effective, and ethical.

Comparing Colly to Other Web Scraping Tools

While Colly is a great choice for web scraping in Go, it‘s certainly not the only option. Here are a few other popular tools and how they compare:

  • Scrapy (Python): Scrapy is a well-established scraping framework for Python. It has a larger ecosystem and more features than Colly, but can be overkill for simple projects. Choose Colly over Scrapy if you prefer Go‘s simplicity and performance.

  • Puppeteer (Node.js): Puppeteer is a Node.js library that provides a high-level API to control headless Chrome. It‘s useful for scraping JavaScript-heavy sites, but requires a full browser environment. Use Colly for simpler, more lightweight scraping that doesn‘t require JS rendering.

  • BeautifulSoup (Python): BeautifulSoup is a Python library for parsing HTML and XML documents. It‘s simple to use but lacks the broader features of a full scraping framework. Colly provides a similar API with OnHTML callbacks, while also supporting the full scraping lifecycle.

Ultimately, the best tool for the job depends on your specific requirements, preferences, and existing tech stack. However, for Gophers looking for a powerful, idiomatic scraping framework, Colly is hard to beat.

Conclusion

In this guide, we‘ve explored the power and potential of web scraping with Go and Colly. We‘ve covered the basics of installing Colly, writing scrapers to extract data from web pages, and storing that data for later analysis. We‘ve also discussed best practices for responsible and ethical scraping.

As you‘ve seen, Colly provides an expressive, idiomatic API for web scraping in Go. Its features around concurrency, caching, and configurability make it a strong choice for a wide range of scraping projects.

However, we‘ve only scratched the surface of what‘s possible with Colly and web scraping in general. As you build your own scrapers, you may encounter challenges around authentication, captchas, dynamic content, and more. Don‘t be discouraged – the Colly community is active and helpful, and there are many resources available to help you level up your scraping skills.

So what are you waiting for? Go forth and scrape! With Colly in your toolkit, you‘ve got the power to extract valuable data from the web and turn it into actionable insights. Happy scraping!

Additional Resources

Did you like this post?

Click on a star to rate it!

Average rating 1 / 5. Vote count: 1

No votes so far! Be the first to rate this post.