Scrape the listings

How to Scrape Real Estate Data from Realtor.com using Python

Realtor.com is one of the most popular real estate listing websites, with data on millions of homes for sale across the United States. For real estate investors, home buyers, and data scientists, this wealth of information can provide valuable insights into the housing market. By using web scraping techniques to extract and analyze realtor.com data, you can:

  • Build a comprehensive dataset of property details like price, square footage, location, etc.
  • Analyze trends and patterns in your local real estate market
  • Identify undervalued or overpriced listings
  • Find and compare investment opportunities

While realtor.com doesn‘t provide a public API to access their data, it‘s possible to scrape listing information directly from the website. In this guide, I‘ll walk through the process of building a web scraper for realtor.com using Python and Selenium.

Before we jump into the code, make sure you have the following prerequisites:

  • Python 3.x installed
  • Selenium package installed (pip install selenium)
  • Chrome web browser
  • ChromeDriver executable

Step 1 – Initial Setup
Our first step is to set up Selenium to navigate to a realtor.com search results page. We‘ll use the URL for Cincinnati, OH properties as an example:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

url = ‘https://www.realtor.com/realestateandhomes-search/Cincinnati_OH

driver = webdriver.Chrome(‘/path/to/chromedriver‘)
driver.get(url)

This opens the Chrome browser, navigates to the Cincinnati listings page, and gives us a driver object to interact with the page.

Step 2 – Scraping Listing Data
Now we need to locate the HTML elements that contain the data we want to extract. Using the browser‘s inspect tool, we can see that each listing is in an

  • element with the class "component_property-card". Within those elements, we can find the price, beds, baths, square footage, address and other fields.

    listings = driver.find_elements(By.XPATH, ‘//li[@data-testid="result-card"]‘)

    for listing in listings:
    price = listing.find_element(By.XPATH,‘.//span[@data-label="pc-price"]‘).text
    beds = listing.find_element(By.XPATH,‘.//li[@data-label="pc-meta-beds"]‘).text
    baths = listing.find_element(By.XPATH,‘.//li[@data-label="pc-meta-baths"]‘).text
    sqft = listing.find_element(By.XPATH,‘.//li[@data-label="pc-meta-sqft"]‘).text
    address = listing.find_element(By.XPATH,‘.//div[@data-label="pc-address"]‘).text
    print(f‘{address} – {price} – {beds} – {baths} – {sqft}‘)

    This locates all the listing

  • elements, then loops through to extract and print the target data points from each one. We use XPATH selectors to precisely target the right elements.

    Step 3 – Paginating Through Results
    The realtor.com search results are paginated, so our script above only scrapes the first page of listings. To get more data, we need to navigate through the additional pages.

    We can do this by finding and clicking the "Next" button after scraping each page. The button is an tag with an aria-label of "Go to next page":

    while True:

    listings = driver.find_elements(By.XPATH, ‘//li[@data-testid="result-card"]‘)
    for listing in listings:
        ...
    
    # Click next page
    try:
        next_button = driver.find_element(By.XPATH, ‘//a[@title="Go to Next Page"]‘)
        next_button.click()
        print(‘Navigating to next page...‘)
    except:
        print(‘No more pages!‘)
        break

    This loops through all the search result pages, scraping the listings on each page until there are none left. Make sure to add a try/except block to break the loop when it reaches the final page.

    Step 4 – Handling CAPTCHAs and Bot Detection
    As you paginate through many pages and make lots of requests to realtor.com, you‘ll likely encounter a CAPTCHA challenge to prove you‘re not a bot. There are a few ways to deal with these:

    1. Manually solve the CAPTCHAs as they appear. You could add a prompt for the user to press a key to continue after solving it.

    2. Use a CAPTCHA solving service that uses human workers to solve the CAPTCHAs and return the solution via API. This costs money but allows you to automate the whole process.

    3. Introduce delays and randomness into your scraper‘s actions (scrolling, mouse movements, wait times between pages) to mimic human behavior and avoid triggering the CAPTCHAs.

    4. Outsource the whole process to a web scraping service like ScrapingBee. Their platform runs your scraper code in the cloud, handling all the headless browser configuration, IP rotation, and CAPTCHA solving behind the scenes. Here‘s what the code would look like using ScrapingBee:

    from scrapingbee import ScrapingBeeClient

    client = ScrapingBeeClient(api_key=‘YOUR_API_KEY‘)
    response = client.get(
    https://www.realtor.com/realestateandhomes-search/Cincinnati_OH‘,
    params = {
    ‘extract_rules‘: {
    ‘listings‘: {
    ‘selector‘: ‘//li[@data-testid="result-card"]‘,
    ‘type‘: ‘list‘,
    ‘output‘: {
    ‘price‘: ‘.//span[@data-label="pc-price"]‘,
    ‘beds‘: ‘.//li[@data-label="pc-meta-beds"]‘,
    ‘baths‘: ‘.//li[@data-label="pc-meta-baths"]‘,
    ‘sqft‘: ‘.//li[@data-label="pc-meta-sqft"]‘,
    ‘address‘: ‘.//div[@data-label="pc-address"]‘
    }
    },
    ‘next_page‘: {
    ‘selector‘: ‘//a[@title="Go to Next Page"]‘,
    ‘output‘: ‘@href‘
    }
    }
    }
    )

    listings = response.json()[‘listings‘] next_page = response.json()[‘next_page‘]

    With this approach, ScrapingBee handles the entire web scraping process, returning structured JSON data that‘s ready to use for analysis. It automatically rotates IP addresses and solves CAPTCHAs so you don‘t have to worry about realtor.com blocking your scraper.

    Advanced Tips
    Here are a few ideas to take your realtor.com scraper to the next level:

    • Click into each individual listing page to extract even more data points like property details, agent info, and location details. You‘ll need to scrape and navigate many separate listing pages to build a complete dataset.

    • Set up a cron job or scheduled task to run your scraper automatically every day/week/month to keep your real estate dataset fresh with the latest listings.

    • Store your scraped data in a SQL database or export it to CSV files for easy analysis in Excel, Python, Tableau, etc.

    • Combine your realtor.com data with other demographic, economic, and geographic datasets to uncover even more insights about your target real estate market.

    I hope this guide gives you a solid foundation for scraping real estate listing data from realtor.com. With some web scraping skills, a little Python code, and a service like ScrapingBee, you can build a powerful dataset to help with your home search, real estate investing, or data analysis. Happy scraping!

    Did you like this post?

    Click on a star to rate it!

    Average rating 0 / 5. Vote count: 0

    No votes so far! Be the first to rate this post.