How to extract a table's content in Python

How to Extract Data from HTML Tables Using Python

Extracting data from HTML tables is a common task in web scraping. Tables are used to display structured data on many websites, from stock prices and sports statistics to product information and more. Being able to programmatically extract this tabular data unlocks many possibilities for data analysis, research, and building new applications.

In this post, we‘ll explore different ways to parse and extract data from HTML tables using Python. Whether you‘re new to web scraping or an experienced data wrangler, this guide will walk you through the process step-by-step with code examples. We‘ll cover:

  • Using BeautifulSoup to parse HTML tables
  • Extracting table data with pandas read_html
  • Scraping tabular data at scale with Scrapy
  • Tips for handling different table structures and layouts
  • Real-world examples and use cases

Let‘s dive in!

Parsing HTML Tables with BeautifulSoup

BeautifulSoup is a popular Python library for parsing HTML and XML documents. It provides a convenient way to extract data from web pages, including tables.

First, make sure you have BeautifulSoup installed:

pip install beautifulsoup4

Here‘s an example of how to use BeautifulSoup to extract data from a simple HTML table:

from bs4 import BeautifulSoup

html = """
<table>
  <thead>
    <tr>
      <th>Company</th>
      <th>Contact</th>
      <th>Country</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Alfreds Futterkiste</td>
      <td>Maria Anders</td>
      <td>Germany</td>
    </tr>
    <tr>
      <td>Centro comercial Moctezuma</td>
      <td>Francisco Chang</td>
      <td>Mexico</td>
    </tr>
  </tbody>
</table>
"""

soup = BeautifulSoup(html, ‘html.parser‘)

# Find the table
table = soup.find(‘table‘)

# Extract the header row
header_row = table.find(‘thead‘).find_all(‘th‘)
headers = [header.text.strip() for header in header_row]

# Extract the data rows
data_rows = table.find(‘tbody‘).find_all(‘tr‘) 
data = []
for row in data_rows:
    cells = row.find_all(‘td‘)
    values = [cell.text.strip() for cell in cells]
    data.append(values)

print(headers)
print(data)

This will output:

[‘Company‘, ‘Contact‘, ‘Country‘]
[[‘Alfreds Futterkiste‘, ‘Maria Anders‘, ‘Germany‘], 
 [‘Centro comercial Moctezuma‘, ‘Francisco Chang‘, ‘Mexico‘]]

The key steps are:

  1. Find the
    element
  2. Extract the header row (
  3. elements within)
  4. For each row, extract the cell values (
  5. elements)
  6. Extract the data rows (
  7. elements)

    BeautifulSoup makes it easy to locate elements using methods like find() and find_all(). You can search by tag name, CSS class, id, and more.

    However, BeautifulSoup can struggle with extracting tables that have inconsistent structures, rowspan/colspan attributes, or nested tables. In those cases, you may need to write more complex parsing logic.

    Using Pandas to Extract Tables

    If you‘re working with data in Python, chances are you‘re already using the pandas library. Pandas has a convenient function called read_html() that can extract tables from HTML into a list of DataFrame objects.

    Here‘s an example:

    import pandas as pd
    
    url = ‘https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue‘
    
    tables = pd.read_html(url) 
    
    print(len(tables))
    print(tables[0].head())

    Output:

    3
                         Company   Revenue (by US$ million)[1]  ...    Headquarters Location Founder(s)
    0                   Walmart                      572754  ...  Bentonville, Arkansas,  Sam Walton
    1  State Grid Corporation of   387056                       ...      Haidian, Beijing,  State-owned
    2         Amazon.com, Inc.   469822                       ...   Seattle, Washington,  Jeff Bezos
    3      PetroChina Company Limited                      442384  ...       Beijing, China,  State-owned
    4                      Sinopec                  430803  ...   Beijing, China,  State-owned

    Pandas read_html() automatically detects tables in the HTML and parses them into separate DataFrames. This is incredibly handy if the data you need is already in a well-structured

    .

    However, read_html() has some limitations:

    • It may fail on poorly structured tables
    • It doesn‘t handle nested tables well
    • You have less control over the parsing process compared to BeautifulSoup

    Scraping Tables at Scale with Scrapy

    For scraping tabular data at scale, nothing beats the Scrapy framework. Scrapy is a powerful tool for building web crawlers and extracting structured data from websites.

    Here‘s how you might use Scrapy to extract table data:

    import scrapy
    
    class CompanySpider(scrapy.Spider):
        name = ‘companies‘
        start_urls = [‘https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue‘]
    
        def parse(self, response):
            table = response.css(‘table.wikitable‘)[0]
            rows = table.css(‘tr‘)
    
            headers = [th.get() for th in rows[0].css(‘th::text‘)]
    
            for row in rows[1:]:
                cells = row.css(‘td‘)
                values = [cell.get() for cell in cells.css(‘td::text‘)]  
                yield dict(zip(headers, values))
    

    This spider locates the table using CSS selectors, extracts the header row, and then yields a dictionary for each data row mapping the headers to cell values.

    You could expand this spider to crawl to multiple pages, handle pagination, and save the extracted data to a database or file.

    Tips for Parsing HTML Tables

    Here are a few more tips to keep in mind when parsing HTML tables with Python:

    • Check for inconsistencies in the table structure (missing cells, extra rows, etc)
    • Watch out for cells that span multiple rows/columns
    • Be aware of nested tables
    • Use CSS classes, ids, and other attributes to locate specific tables
    • Clean and standardize the extracted data (convert to appropriate data types, handle missing values, etc)
    • Consider using a headless browser like Puppeteer if you need to interact with dynamic tables

    Data Extraction Use Cases

    Extracting data from HTML tables powers many real-world applications, such as:

    • Price monitoring and comparison
    • Financial data aggregation
    • Sports statistics and analysis
    • Job listings and company information scraping
    • Research and data journalism

    For example, you could build a tool that tracks prices across multiple e-commerce sites by scraping data from product listings pages. Or you could aggregate financial data from stock exchanges and company filings to power an investment research platform.

    Conclusion

    Parsing and extracting data from HTML tables is a key skill for anyone working with web data in Python. BeautifulSoup and pandas provide easy ways to extract tabular data, while Scrapy is ideal for large-scale table scraping.

    The techniques covered in this guide provide a starting point for tackling table extraction tasks. There are many other libraries and tools in the Python ecosystem for web scraping and data extraction. As you work on more projects, you‘ll develop your own tips and best practices.

    Remember: always be respectful of website owners and follow web scraping best practices. Don‘t overwhelm sites with requests, and comply with robots.txt rules and terms of service.

    Happy scraping!

    Did you like this post?

    Click on a star to rate it!

    Average rating 0 / 5. Vote count: 0

    No votes so far! Be the first to rate this post.