Web Scraping Booking.com with Python: The Ultimate Guide for 2023

Booking.com is one of the world‘s leading travel websites, offering a vast selection of accommodations from hotels to apartments to vacation rentals. With over 28 million listings across 226 countries, Booking.com is a treasure trove of valuable travel and hospitality data.

Web scraping allows us to programmatically extract this data from Booking.com and use it for market research, competitor analysis, pricing optimization, and more. In this comprehensive guide, we‘ll break down exactly how to build a robust web scraper for Booking.com using Python and Selenium.

But first, a word of caution – while web scraping itself is not illegal, how you use the data you scrape may be. It‘s important to respect Booking.com‘s terms of service which prohibits scraping for commercial purposes. Scrape ethically and use the data responsibly.

Setting Up Your Scraping Environment

To get started, you‘ll need:
– Python 3.x installed
– A Python IDE or text editor
– Chrome or Firefox web browser

We‘ll be using a few key Python libraries:

  • selenium for browser automation and extracting data
  • webdriver-manager to automatically download the appropriate webdriver
  • pandas for storing and analyzing the scraped data

You can install them with pip:


pip install selenium webdriver-manager pandas

Analyzing Booking.com‘s Structure

To scrape effectively, we first need to understand the layout of the data we want to extract. Let‘s analyze the structure of a typical Booking.com search results page:

The key data points we want to extract for each property are:

  • Name
  • URL
  • Price
  • Review score
  • Number of reviews
  • Location
  • Thumbnail image

Using your browser‘s developer tools, you can inspect each element to determine the appropriate selectors to use to locate this data. Booking.com makes this relatively easy by using semantic class names and data attributes.

For example, we can see that each property card is wrapped in a div with the attribute data-testid="property-card". Within that, the name and URL are located in an a tag with data-testid="title-link".

Extracting Property Details

Now that we‘ve identified the structure, we can start writing our Python script to extract the data.

First, we‘ll import our dependencies and set up Selenium with Chrome:


from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
import pandas as pd

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

Next, we‘ll define a function to extract the details for each property card:


def extract_property_details(property_card):
name = property_card.find_element(By.CSS_SELECTOR,‘div[data-testid="title"]‘).text
url = property_card.find_element(By.CSS_SELECTOR,‘a[data-testid="title-link"]‘).get_attribute(‘href‘)
price = property_card.find_element(By.CSSSELECTOR, ‘[data-testid="price-and-discounted-price"]‘).text
score,
, reviews = property_card.find_element(By.CSS_SELECTOR, ‘[data-testid="review-score"]‘).text.split(‘\n‘)
location = property_card.find_element(By.CSS_SELECTOR, ‘[data-testid="address"]‘).text
image = property_card.find_element(By.CSS_SELECTOR, ‘[data-testid="image"]‘).get_attribute(‘src‘)

return {
    ‘name‘: name,
    ‘url‘: url, 
    ‘price‘: price,
    ‘score‘: score,
    ‘reviews‘: reviews,
    ‘location‘: location,
    ‘image‘: image
}

This locates each element using CSS selectors based on the data-testid attributes, extracts the relevant data as text or attributes, and returns it as a dictionary.

To use it, we‘ll have our script navigate to a search results URL, locate all the property cards on the page, and apply our extract_property_details function to each one:


url = "https://www.booking.com/searchresults.html?ss=new+york+city"
driver.get(url)

property_cards = driver.find_elements(By.CSS_SELECTOR,‘div[data-testid="property-card"]‘)

results = [] for card in property_cards:
details = extract_property_details(card)
results.append(details)

We store the extracted dictionaries in a results list.

Handling Pagination

Search results on Booking.com are paginated, with a "Next" button to load the next page of properties. To scrape multiple pages, we need to locate this pager, extract the total number of pages, and loop through each page while clicking "Next".


num_pages = int(driver.find_element(By.CSS_SELECTOR, ‘[data-testid="pagination"] li:last-child‘).text)

for page in range(num_pages):
property_cards = driver.find_elements(By.CSS_SELECTOR,‘div[data-testid="property-card"]‘)
for card in property_cards:
details = extract_property_details(card)
results.append(details)

next_button = driver.find_element(By.CSS_SELECTOR,‘button[aria-label="Next page"]‘)
next_button.click()

This clicks through each page, scrapes the properties, and appends the results to our master list.

An extra challenge here is that Booking.com loads new results dynamically with JavaScript when paging. To ensure the new page has loaded before scraping, we can use Selenium‘s explicit wait functionality to pause execution until an element from the next page is located:


from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

wait = WebDriverWait(driver, 10)

next_button = driver.find_element(By.CSS_SELECTOR,‘button[aria-label="Next page"]‘)
next_button.click()

wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, ‘[data-testid="property-card"]‘)))

This will wait up to 10 seconds for a property card to be present on the next page before proceeding.

Storing and Analyzing the Data

Once we‘ve extracted data from all the pages, we can store it in a structured format for further analysis. Pandas allows us to convert our list of dictionaries into a DataFrame and write it out to a CSV file:


df = pd.DataFrame(results)
df.to_csv(‘booking_results.csv‘, index=False)

With the data in a DataFrame, we can easily analyze it – finding the most expensive properties, the highest rated, most popular locations, etc.


print(df.nlargest(5, ‘price‘))
print(df.nlargest(5, ‘score‘))
print(df[‘location‘].value_counts())

Scaling and Robustness

This basic scraper works well for extracting data from Booking.com, but there are some challenges to consider when scaling it up for production use.

Booking.com may block your IP if you make too many requests too quickly. To avoid this, you can:

  • Throttle your request rate by adding delays between requests
  • Use a pool of proxy IPs to distribute requests
  • Randomize your user agent string to avoid browser fingerprinting

Selenium can also be brittle if page layout or selectors change frequently. It‘s a good practice to build in error handling and alerts to notify you if the scraper hits any issues.

For improved performance when scraping many pages, you can use parallel processing to run multiple browser instances at once, or consider a headless browser like Splash.

Wrap Up

Web scraping is a powerful tool for extracting data from sites like Booking.com for market research, pricing analysis, and more. With Python, Selenium, and a bit of web savvy, you can build a robust scraper to fit your data needs.

Just remember to always scrape ethically, respect site terms of service, and use the data responsibly. Happy scraping!

Additional Resources

To learn more about web scraping with Python, check out:
Modern Web Automation With Python and Selenium
Python Selenium Tutorial: Getting Started With Web Automation
Python Web Scraping Tutorials

Did you like this post?

Click on a star to rate it!

Average rating 0 / 5. Vote count: 0

No votes so far! Be the first to rate this post.