Web scraping is a powerful technique for extracting data from websites. By writing scripts to automatically retrieve web pages and parse the content, we can collect large amounts of data for analysis, archiving, tracking changes over time, or other purposes.
Google Sheets is a convenient place to store scraped data, allowing us to easily view, share, and manipulate the data using spreadsheet formulas and tools. In this tutorial, we‘ll walk through how to use JavaScript and Google Apps Script to write a web scraper that extracts data directly into a Google Sheets spreadsheet.
Setting up the Spreadsheet
Before writing the scraping script, let‘s set up a new Google Sheets spreadsheet to store the extracted data. Create a new blank spreadsheet and give it a name related to the website you‘ll be scraping.
Create column headers for the data fields you plan to extract. For example, if scraping blog posts, you might have columns for Title, URL, Author, Date, etc.
Writing the Scraping Script
Google Sheets allows you to write custom JavaScript (well, technically Google Apps Script) that can interact with the active spreadsheet. We‘ll use this capability to write our scraping code.
From the Tools menu, select Script editor. This will open a new tab with a blank Google Apps Script project.
Retrieving Page Content
To retrieve the web page‘s HTML content, we‘ll use the built-in UrlFetchApp.fetch() method. This fetches the HTML content of the given URL as a string.
const url = ‘https://example.com/blog‘;
const response = UrlFetchApp.fetch(url);
const html = response.getContentText();
Parsing and Extracting Data
Now that we have the raw HTML content, we need to parse it to extract the desired data. There are a few different approaches, depending on the structure of the page:
Regular expressions: For simple, predictable pages, you may be able to use regular expressions to match and extract pieces of content. Be cautious though, as a minor change to the page structure can break brittle regexes.
querySelector() / querySelectorAll(): If the target data uses semantic, consistent CSS classes or IDs, we can use these DOM methods to select elements and extract their content or attributes. Google Apps Script recently added support for these.
XML parsing: Apps Script has built-in XML parsing support. By parsing the HTML as XML, we can retrieve elements by tag name and navigate the DOM tree.
Here‘s an example using querySelector() to get an array of all the blog post "preview card" elements:
const postCards = html.querySelectorAll(‘.post-card‘);
We can then loop through the cards, extracting the title text and link URL from each one:
const data = [];
for (const card of postCards) {
const title = card.querySelector(‘.post-title‘).textContent;
const url = card.querySelector(‘.post-link‘).getAttribute(‘href‘);
data.push([title, url]);
}
Storing Data in the Spreadsheet
Finally, let‘s output all the collected data to the spreadsheet, appending after the last existing row:
const sheet = SpreadsheetApp.getActiveSheet();
const lastRow = sheet.getLastRow();
sheet.getRange(lastRow + 1, 1, data.length, data[0].length).setValues(data);
Complete Example
Here‘s the full code for a blog post title and URL scraper:
function scrape() {
const url = ‘https://example.com/blog‘;
const response = UrlFetchApp.fetch(url);
const html = response.getContentText();
const postCards = html.querySelectorAll(‘.post-card‘);
const data = [];
for (const card of postCards) {
const title = card.querySelector(‘.post-title‘).textContent;
const url = card.querySelector(‘.post-link‘).getAttribute(‘href‘);
data.push([title, url]);
}
const sheet = SpreadsheetApp.getActiveSheet();
sheet.getRange(2, 1, data.length, data[0].length).setValues(data);
}
To run it, select the function name in the toolbar and click the Run button. You may need to authorize the script to access external URLs and modify the spreadsheet.
Tips and Considerations
Respect robots.txt: Many websites have a robots.txt file specifying rules for web scrapers. To be a considerate web citizen, check robots.txt before scraping a site and follow its access restrictions. The robotstxt-parser library makes this easy.
Limit request rate: Sending requests too rapidly can overwhelm web servers. Add a delay between requests or between batches of requests to avoid negatively impacting the site.
Cache results: Avoid scraping the same page multiple times. Cache the HTML response locally (Apps Script has a Cache service) and reuse it for subsequent runs instead of repeatedly fetching the same unmodified content.
Google Sheets limits: Google Sheets is not designed as a large-scale database. It has a limit of 10 million cells. For collections larger than this, consider instead structuring your scraper to save the data to a dedicated database or data warehouse like MySQL or BigQuery. You can still connect this to Google Sheets for analysis and visualization.
Conclusion
Google Sheets and Google Apps Script provide a quick and easy way to store and view data extracted by web scrapers. The built-in JavaScript environment allows writing short scraping scripts that pull data directly into a spreadsheet.
Hopefully this tutorial has given you the knowledge needed to start collecting your own datasets from across the web. Just remember to always scrape responsibly, respect site owners, and consider appropriate data storage for larger scraping jobs. Happy scraping!