Yelp is a treasure trove of valuable data for businesses looking to gain a competitive edge. With over 244 million reviews and 5.2 million unique businesses listed across 190+ countries, Yelp data can provide powerful insights to inform your market research, lead generation, competitor analysis and more.
In this comprehensive guide, we‘ll dive deep into the why and how of scraping data from Yelp. As a web scraping expert, I‘ll share my tips and tactics for extracting data efficiently and ethically, overcoming challenges like rate limiting and CAPTCHAs, and analyzing the data for actionable insights.
Why Scrape Yelp Data?
So what can you do with Yelp data? The applications are nearly endless, but here are some of the top use cases I see:
Market Research – Analyze customer sentiment, preferences and trends in your vertical and location to guide product development and positioning. Identify whitespace opportunities with underserved audiences.
Competitive Analysis – Monitor competitor performance metrics like review volume, rating, and customer feedback. Reverse-engineer their strengths and weaknesses. Benchmark your own performance on key metrics.
Lead Generation – Source high-intent leads who have reviewed or checked in at relevant businesses. Yelp‘s rich business profiles can provide key contact information and context to personalize outreach.
Reputation Management – Track brand mentions and promptly respond to reviews. Perform sentiment analysis to quantify brand perception and changes over time.
Business Intelligence – Combine Yelp data with other sources like foot traffic, transaction, and census data to model business performance drivers and customer lifetime value.
For example, imagine you‘re launching a new restaurant and want to understand the competitive landscape in your area. By scraping Yelp data on other restaurants within a 5-mile radius, you could uncover insights like:
- Which cuisines and price points are most popular?
- What do customers praise or complain about in their reviews?
- How does foot traffic vary by day of week and time of day?
- What is the average star rating and review count of the top 10 restaurants?
Equipped with these insights, you could make data-driven decisions on everything from menu design to promotions to staffing.
Yelp Scraping Fundamentals
At its core, scraping Yelp data involves just a few key steps:
Determine your target data – Decide what fields you need to extract from which pages on Yelp. Typical data points include:
- Business name, categories, contact info, hours, etc. from business pages
- Review text, rating, author, date from review snippets
- Search rank and meta info from results pages
Inspect page structure – Use your browser‘s DevTools to examine the HTML structure of your target pages and identify the elements containing your target data. Take note of any patterns like CSS class names that can help you precisely select those elements.
Fetch page HTML – Make an HTTP request to fetch the HTML source of your target pages. You‘ll want to use a tool like Python‘s
requests
library to automate this and handle things like session cookies.Parse HTML and extract data – Once you have the raw HTML, you‘ll need to parse it into a navigable structure using a library like
BeautifulSoup
. You can then use methods like CSS selectors to pinpoint your target elements and extract the text or attributes.Handle pagination – Unless you only need data from a single page, you‘ll likely need to navigate through multiple pages of results or reviews. You can find navigation links in the pagination element and recursively fetch and parse each page.
Store extracted data – As you extract data, you‘ll want to save it somewhere for later analysis. A CSV file or database works well for tabular data, while a document store like MongoDB can be handy for free-form text like reviews.
Here‘s a simplified example of these steps in Python to extract Yelp business data:
import requests
from bs4 import BeautifulSoup
url = ‘https://www.yelp.com/biz/prospect-san-francisco‘
# Fetch page HTML
response = requests.get(url)
# Parse HTML
soup = BeautifulSoup(response.text, ‘html.parser‘)
# Extract business details
name = soup.select_one(‘.css-11q1g5y‘).text
rating = float(soup.select_one(‘.i-stars‘)[‘aria-label‘].split(‘ ‘)[0])
review_count = int(soup.select_one(‘.css-bq71j2‘).text.split(‘ ‘)[0])
address = ‘ ‘.join(soup.select_one(‘.street-address‘).text.split())
phone = soup.select_one(‘.css-na3oda‘).text
# Print extracted data
print(f"""
Business Name: {name}
Star Rating: {rating}
# of Reviews: {review_count}
Address: {address}
Phone: {phone}
""")
This would output:
Business Name: Prospect
Star Rating: 4.5
# of Reviews: 1389
Address: 300 Spear St San Francisco, CA 94105
Phone: (415) 247-7770
Scaling Your Yelp Scraping
While small one-off scraping tasks are relatively straightforward, efficiently scraping Yelp data at scale introduces some challenges:
Rate limiting – Yelp servers will throttle or block you if you make too many requests too quickly. You‘ll need to add delays between requests and spread them out across different IP addresses.
CAPTCHAs – Yelp may serve a CAPTCHA to check if you‘re a bot. You‘ll need to detect these, solve them with an automated solver service, and retry your request.
Inconsistent page structures – As Yelp evolves its design over time, you may encounter new or different page structures and CSS selectors. Your code needs to be flexible to handle this gracefully. Techniques like fuzzy element selection with regular expressions can help.
Maintaining session state – Yelp uses cookies and other stateful mechanisms to track a sequence of requests. You‘ll need to maintain cookies across requests and handle authentication like a logged-in user.
Fortunately, there are tools and services that can help address these issues. For serious Yelp scraping at scale, I recommend looking at:
Headless browsers like Puppeteer that can automate a full browser session, handling things like JS rendering, sessions and CAPTCHAs seamlessly. Bonus: it can screenshot or PDF pages.
Proxy networks like Bright Data or ProxyMesh that allow you to route requests through millions of different IP addresses. Many offer easy integrations to automatically rotate IPs.
CAPTCHA solvers like 2Captcha or DeathByCaptcha that use a combination of OCR and human labor to solve CAPTCHAs at scale for fractions of a cent.
Scraping platforms like ScrapingBee or ParseHub that provide a managed headless browser environment with built-in proxy rotation, CAPTCHA solving and more.
Using ScrapingBee, for example, you could rewrite the earlier script to scrape Yelp data through a headless browser with automatic proxy rotation:
from scrapingbee import ScrapingBeeClient
client = ScrapingBeeClient(api_key=‘MY_API_KEY‘)
url = ‘https://www.yelp.com/biz/prospect-san-francisco‘
response = client.get(
url,
params = {
‘premium_proxy‘: ‘true‘,
‘country_code‘: ‘us‘
},
render_js = True
)
soup = BeautifulSoup(response.content, ‘html.parser‘)
# Extract data from parsed HTML as before
Analyzing Yelp Data
So you‘ve scraped a wealth of Yelp data – now what? The fun part is deriving insights to drive decisions. Here are a few ideas:
Sentiment Analysis – Is the overall sentiment in reviews positive or negative? What are the most common praises and complaints? You can use a tool like VADER to score sentiment and extract key phrases. Plotting review sentiment over time is a great way to gauge brand perception.
Competitor Benchmarking – How do you stack up against your competitors on key metrics like review count, star rating, and review recency? Calculate min/max/median statistics and percentiles to see where you rank.
Psychographic Segmentation – What are the interests, attitudes and personalities of your best customers? Clustering reviewers based on the other businesses they‘ve reviewed and terms they use can surface compelling customer personas to target.
Menu Optimization – Which menu items are most popular and well-received in reviews? Are there items that are polarizing or frequently mentioned together? A frequency analysis of menu item terms can identify winners to highlight and losers to cut.
Operational Insights – Do reviews mention long wait times, rude service, or other operational issues? Tracking the incidence of these terms over time alongside operational changes can measure progress.
There are many other analyses you could perform on Yelp data – it all depends on your specific goals and needs. The key is having a systematic process to continuously collect data and derive actionable insights from it.
The Ethics of Scraping Yelp
As a final note, I think it‘s important to touch on the ethics of scraping data from Yelp. While it can be an incredibly valuable source of business intelligence, Yelp has been aggressive in fighting back against unauthorized scraping and considers it a violation of their terms of service.
In 2019, Yelp sued Revleap, a company that scraped Yelp data to help businesses solicit reviews, claiming copyright infringement, breach of contract, and more. The courts ultimately sided with Yelp and awarded them $29 million in damages.
I‘m not a lawyer and can‘t give legal advice, but I believe there are ways to scrape Yelp data ethically and legally:
- Use Yelp‘s public APIs for access to certain approved datasets.
- Respect robots.txt restrictions on which pages can be crawled.
- Don‘t use scraped data to spam or harass reviewers.
- Don‘t republish scraped reviews or personally identifying information.
- Consider partnering with companies that have authorized access to Yelp data.
Ultimately, any business looking to scrape Yelp data should carefully consult with their legal counsel to assess their specific situation and risk tolerance.
Conclusion
Yelp is an incredibly rich source of business data waiting to be tapped for insights. With the right tools and techniques, you can efficiently scrape granular data on millions of businesses and reviews.
By following the steps and best practices outlined in this guide, you‘ll be well on your way to unlocking the power of Yelp data for your business. Just be sure to approach it ethically and legally to stay above board.
The potential applications are limitless – from market research to lead generation to competitor benchmarking. All that‘s needed is some technical know-how and a clear strategy for deriving actionable insights from the data.
I‘ll leave you with some final tips for effective Yelp scraping:
- Start small with a proof of concept before scaling up your scraped dataset
- Continuously monitor and adapt to changes in Yelp‘s page structure and anti-bot measures
- Use a scraping platform like ScrapingBee to handle the heavy lifting of proxy rotation, CAPTCHAs and more
- Invest time in data cleaning and modeling to structure data for easy analysis
- Focus on deriving a few key actionable insights rather than getting lost in the weeds of analysis
- Consult with legal counsel to ensure you‘re playing by the rules
Happy scraping!