How to extract a table's content in Python

How to Extract Data from HTML Tables Using Python

Extracting data from HTML tables is a common task in web scraping. Tables are used to display structured data on many websites, from stock prices and sports statistics to product information and more. Being able to programmatically extract this tabular data unlocks many possibilities for data analysis, research, and building new applications.

In this post, we‘ll explore different ways to parse and extract data from HTML tables using Python. Whether you‘re new to web scraping or an experienced data wrangler, this guide will walk you through the process step-by-step with code examples. We‘ll cover:

Using BeautifulSoup to parse HTML tables
Extracting table data with pandas read_html
Scraping tabular data at scale with Scrapy
Tips for handling different table structures and layouts
Real-world examples and use cases

Let‘s dive in!

Parsing HTML Tables with BeautifulSoup

BeautifulSoup is a popular Python library for parsing HTML and XML documents. It provides a convenient way to extract data from web pages, including tables.

First, make sure you have BeautifulSoup installed:

pip install beautifulsoup4

Here‘s an example of how to use BeautifulSoup to extract data from a simple HTML table:

from bs4 import BeautifulSoup

html = """
<table>
  <thead>
    <tr>
      <th>Company</th>
      <th>Contact</th>
      <th>Country</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Alfreds Futterkiste</td>
      <td>Maria Anders</td>
      <td>Germany</td>
    </tr>
    <tr>
      <td>Centro comercial Moctezuma</td>
      <td>Francisco Chang</td>
      <td>Mexico</td>
    </tr>
  </tbody>
</table>
"""

soup = BeautifulSoup(html, ‘html.parser‘)

# Find the table
table = soup.find(‘table‘)

# Extract the header row
header_row = table.find(‘thead‘).find_all(‘th‘)
headers = [header.text.strip() for header in header_row]

# Extract the data rows
data_rows = table.find(‘tbody‘).find_all(‘tr‘) 
data = []
for row in data_rows:
    cells = row.find_all(‘td‘)
    values = [cell.text.strip() for cell in cells]
    data.append(values)

print(headers)
print(data)

This will output:

[‘Company‘, ‘Contact‘, ‘Country‘]
[[‘Alfreds Futterkiste‘, ‘Maria Anders‘, ‘Germany‘], 
 [‘Centro comercial Moctezuma‘, ‘Francisco Chang‘, ‘Mexico‘]]

The key steps are:

Find the
element

Extract the header row (

elements within)

For each row, extract the cell values (

elements)

Extract the data rows (

elements)

BeautifulSoup makes it easy to locate elements using methods like find() and find_all(). You can search by tag name, CSS class, id, and more.

However, BeautifulSoup can struggle with extracting tables that have inconsistent structures, rowspan/colspan attributes, or nested tables. In those cases, you may need to write more complex parsing logic.

Using Pandas to Extract Tables

If you‘re working with data in Python, chances are you‘re already using the pandas library. Pandas has a convenient function called read_html() that can extract tables from HTML into a list of DataFrame objects.

Here‘s an example:

import pandas as pd

url = ‘https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue‘

tables = pd.read_html(url) 

print(len(tables))
print(tables[0].head())

Output:

3
                     Company   Revenue (by US$ million)[1]  ...    Headquarters Location Founder(s)
0                   Walmart                      572754  ...  Bentonville, Arkansas,  Sam Walton
1  State Grid Corporation of   387056                       ...      Haidian, Beijing,  State-owned
2         Amazon.com, Inc.   469822                       ...   Seattle, Washington,  Jeff Bezos
3      PetroChina Company Limited                      442384  ...       Beijing, China,  State-owned
4                      Sinopec                  430803  ...   Beijing, China,  State-owned

Pandas read_html() automatically detects tables in the HTML and parses them into separate DataFrames. This is incredibly handy if the data you need is already in a well-structured

However, read_html() has some limitations:

It may fail on poorly structured tables
It doesn‘t handle nested tables well
You have less control over the parsing process compared to BeautifulSoup

Scraping Tables at Scale with Scrapy

For scraping tabular data at scale, nothing beats the Scrapy framework. Scrapy is a powerful tool for building web crawlers and extracting structured data from websites.

Here‘s how you might use Scrapy to extract table data:

import scrapy

class CompanySpider(scrapy.Spider):
    name = ‘companies‘
    start_urls = [‘https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue‘]

    def parse(self, response):
        table = response.css(‘table.wikitable‘)[0]
        rows = table.css(‘tr‘)

        headers = [th.get() for th in rows[0].css(‘th::text‘)]

        for row in rows[1:]:
            cells = row.css(‘td‘)
            values = [cell.get() for cell in cells.css(‘td::text‘)]  
            yield dict(zip(headers, values))

This spider locates the table using CSS selectors, extracts the header row, and then yields a dictionary for each data row mapping the headers to cell values.

You could expand this spider to crawl to multiple pages, handle pagination, and save the extracted data to a database or file.

Tips for Parsing HTML Tables

Here are a few more tips to keep in mind when parsing HTML tables with Python:

Check for inconsistencies in the table structure (missing cells, extra rows, etc)
Watch out for cells that span multiple rows/columns
Be aware of nested tables
Use CSS classes, ids, and other attributes to locate specific tables
Clean and standardize the extracted data (convert to appropriate data types, handle missing values, etc)
Consider using a headless browser like Puppeteer if you need to interact with dynamic tables

Data Extraction Use Cases

Extracting data from HTML tables powers many real-world applications, such as:

Price monitoring and comparison
Financial data aggregation
Sports statistics and analysis
Job listings and company information scraping
Research and data journalism

For example, you could build a tool that tracks prices across multiple e-commerce sites by scraping data from product listings pages. Or you could aggregate financial data from stock exchanges and company filings to power an investment research platform.

Conclusion

Parsing and extracting data from HTML tables is a key skill for anyone working with web data in Python. BeautifulSoup and pandas provide easy ways to extract tabular data, while Scrapy is ideal for large-scale table scraping.

The techniques covered in this guide provide a starting point for tackling table extraction tasks. There are many other libraries and tools in the Python ecosystem for web scraping and data extraction. As you work on more projects, you‘ll develop your own tips and best practices.

Remember: always be respectful of website owners and follow web scraping best practices. Don‘t overwhelm sites with requests, and comply with robots.txt rules and terms of service.

Happy scraping!

Are Product Hunt‘s Featured Products Still Online Today? A Data-Driven Analysis

Mastering cURL with Proxies: A Web Scraping Expert‘s Guide

The 5 Best Free Proxy Lists for Web Scraping in 2025: An Expert Analysis

API for Dummies: A Web Scraping Expert‘s Guide to APIs in 2024

The Landscape of Python HTTP Clients in 2024: A Web Scraping Perspective

Mastering File Downloads with Puppeteer: An In-Depth Guide

Web Scraping in Rust: Mastering Data Extraction with Reqwest and Scraper

Unlocking Competitive Advantage with No-Code Competitor Monitoring

How to extract a table's content in Python

Related