Extract Text from HTML the Easy Way with Parsel in Python

Web scraping is an invaluable technique for extracting data from websites when an API is not available. The Parsel library provides an intuitive way to parse and extract text from HTML using Python. In this tutorial, you‘ll learn how to use Parsel‘s powerful selectors to easily scrape the content you need.

The Parsel library, part of the Scrapy framework, allows you to construct CSS and XPath selectors to surgically extract elements and text from HTML markup. Compared to larger libraries like BeautifulSoup and Selenium, Parsel keeps things simple and lightweight which is perfect for many web scraping tasks.

To get started, install Parsel and the requests library in your Python environment:

pip install parsel requests

Parsel relies on an HTML document loaded as a string. An easy way to fetch the HTML is using requests:

import requests

url = ‘https://quotes.toscrape.com/‘ 
html = requests.get(url).text

With the HTML retrieved, create a Selector object by passing it to the parsel.Selector constructor:

import parsel

selector = parsel.Selector(text=html)

The Selector allows you to run CSS and XPath queries to find the desired elements. For example, to extract the text from the title tag:

title = selector.css(‘title::text‘).get()
print(title)

This prints: ‘Quotes to Scrape‘

The ::text pseudo-element selects just the inner text node. Parsel‘s .get() method returns the first result. To get all matching results as a list, use .getall() instead.

XPath selectors offer additional power and flexibility. For example, to select all quotes from Albert Einstein on the page:

quotes = selector.xpath(
    ‘//div[@class="quote"]‘
    ‘[.//small[@class="author" and contains(text(), "Einstein")]]‘
    ‘//span[@class="text"]/text()‘
).getall()

print(quotes)

This matches div elements with a "quote" class that contain a child small tag with class "author" and text containing "Einstein". Then it selects the text from the child span with class "text".

Parsel also makes it easy to remove elements that are no longer needed using .remove():

selector.css(‘footer‘).remove()

Putting it all together, here‘s an example spider that extracts all the quotes and saves them to a JSON file:

import requests 
import parsel
import json

url = ‘https://quotes.toscrape.com/‘
html = requests.get(url).text

selector = parsel.Selector(text=html)

quotes = []
for quote in selector.css(‘.quote‘):
    quotes.append({
        ‘text‘: quote.css(‘.text::text‘).get(),
        ‘author‘: quote.css(‘.author::text‘).get(), 
        ‘tags‘: quote.css(‘.tag::text‘).getall(),
    })

with open(‘quotes.json‘, ‘w‘) as f:
    json.dump(quotes, f)

With just a few lines of code, Parsel makes it a breeze to extract structured data from web pages. Its simple API belies its power and expressiveness. While larger frameworks like Scrapy and BeautifulSoup have their place, Parsel hits the sweet spot for many scraping tasks.

In summary, Parsel is a lightweight but capable library for parsing HTML and XML with Python. It leverages familiar CSS and XPath selectors to surgically extract the data you need with minimal overhead. Give Parsel a try the next time you need to scrape some web content!

Did you like this post?

Click on a star to rate it!

Average rating 1 / 5. Vote count: 1

No votes so far! Be the first to rate this post.