Web scraping, the automated extraction of data from websites, is big business. According to a recent report from MarketsandMarkets, the global web scraping services market is expected to grow from $1.6 billion in 2022 to $5.7 billion by 2027, at a CAGR of 29.0% during the forecast period.
One of the most popular targets for web scraping is Walmart.com, the e-commerce site of the world‘s largest company by revenue. With over 80 million products across dozens of categories, Walmart‘s online storefront is a treasure trove of valuable data for companies looking to track prices, monitor inventory, generate leads, or conduct market research.
However, scraping data from Walmart.com at scale is no easy feat. Like many large websites, Walmart employs a variety of bot detection and mitigation techniques to prevent unauthorized access to its content. These include:
- User agent fingerprinting
- IP rate limiting and blocking
- Browser/JavaScript-based bot detection
- CAPTCHA challenges
In this in-depth guide, we‘ll walk through how to build a robust web scraper for Walmart.com using Python, and how to avoid getting blocked in the process. We‘ll cover:
- The basics of making HTTP requests to fetch Walmart product pages
- Parsing HTML and JSON responses to extract key data points
- Dealing with dynamic content loaded via JavaScript
- Using proxies and headless browsers to avoid IP blocking
- Scaling up your scraper using the Scrapy framework
- Walmart-specific tips and best practices from a web scraping expert
Whether you‘re a beginner looking to scrape your first website, or an experienced developer tasked with extracting Walmart data at scale, this guide will provide you with the knowledge and tools you need to succeed.
Understanding the Walmart.com Product Catalog
Before we start writing any code, it‘s important to understand the structure and scale of Walmart‘s online product catalog.
According to Walmart‘s Q3 2022 earnings report, Walmart U.S. e-commerce sales grew 16% YoY, with comps up 8%. The company also noted that it has over 240 million items available on Walmart.com, up from 220 million the previous quarter.
These products are organized into a hierarchy of departments, categories, and subcategories. For example, the "Electronics" department contains categories like "Computers", "TVs", and "Cell Phones", each of which has its own subcategories like "Laptops", "4K TVs", etc.
Walmart.com provides several ways to access this product data:
- HTML pages designed for human viewers, which can be scraped
- A JSON API used by the Walmart.com frontend to dynamically load content
- Structured product feeds in XML and TXT formats for affiliates and sellers
In this guide, we‘ll focus on scraping the human-readable HTML pages, as this is the most general approach that can be adapted to other websites. However, it‘s worth noting that the JSON API can be a more reliable and efficient way to extract large amounts of product data, if you can reverse-engineer its schema and authentication methods.
Fetching Product Pages
The first step in any web scraping project is to programmatically fetch the HTML content of the pages you want to extract data from. In Python, this is typically done using the requests
library:
import requests
url = "https://www.walmart.com/ip/Instant-Pot-7-in-1-Electric-Pressure-Cooker-6-Qt-Slow-Cooker-Rice-Cooker-Steamer-Saute-Yogurt-Maker-Sterilizer-Stainless-Steel-13/437205053"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"
}
response = requests.get(url, headers=headers)
print(response.text)
This code sends an HTTP GET request to the URL of a Walmart product page, including a User-Agent
header to identify itself as a Chrome browser. The server responds with the HTML content of the page, which we print out.
However, if you run this code, you may see an error message like this in the response:
<p>To make sure that we maintain a safe environment for our customers and associates, it‘s necessary to verify that you‘re a human.</p>
This indicates that Walmart has detected our script as a bot and is blocking access. In my testing, Walmart blocks requests based on several factors:
- Missing or suspicious
User-Agent
and other headers - High request rate from a single IP address
- Accessing too many pages in a short period of time
- Abnormal browsing patterns (e.g. only viewing product pages)
To avoid triggering these bot detection rules, we need to make our scraper look and act more like a human user. One way to do this is by using a real web browser like Chrome or Firefox to make requests, rather than the requests
library.
We can automate this using a library like Selenium:
from selenium import webdriver
url = "https://www.walmart.com/ip/Instant-Pot-7-in-1-Electric-Pressure-Cooker-6-Qt-Slow-Cooker-Rice-Cooker-Steamer-Saute-Yogurt-Maker-Sterilizer-Stainless-Steel-13/437205053"
driver = webdriver.Chrome()
driver.get(url)
html = driver.page_source
print(html)
driver.quit()
This code launches a headless Chrome browser, loads the Walmart product page, and prints the page HTML. By using a real browser, we avoid some of the bot detection triggers based on headers and behavior.
However, this approach has some downsides. Automated browsers are slower than making requests directly, and they can still be detected based on fingerprinting techniques. Walmart also requires solving a CAPTCHA in some cases, which Selenium can‘t handle on its own.
A better solution is to use an API like ScrapingBee, which handles browser automation and CAPTCHAs behind the scenes:
from scrapingbee import ScrapingBeeClient
client = ScrapingBeeClient(api_key=‘YOUR_API_KEY‘)
url = "https://www.walmart.com/ip/Instant-Pot-7-in-1-Electric-Pressure-Cooker-6-Qt-Slow-Cooker-Rice-Cooker-Steamer-Saute-Yogurt-Maker-Sterilizer-Stainless-Steel-13/437205053"
response = client.get(url)
print(response.content)
ScrapingBee rotates IP addresses, solves CAPTCHAs, and renders JavaScript automatically, so we can focus on parsing the data we need.
Parsing Product Data
Once we have the HTML content of a Walmart product page, we need to extract the relevant data points into a structured format. For this, we‘ll use the beautifulsoup
library:
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.content, ‘html.parser‘)
name = soup.find(‘h1‘, {‘itemprop‘: ‘name‘}).text.strip()
price = soup.select_one(‘.price-characteristic‘).text
images = [img[‘src‘] for img in soup.select(‘#main-image-carousel img‘)]
description = soup.select_one(‘#about-product-section‘).text.strip()
print({‘name‘: name, ‘price‘: price, ‘images‘: images, ‘description‘: description})
This code finds and extracts the product name, price, image URLs, and description from the page HTML using CSS selectors. The results are printed as a Python dictionary.
However, you may notice that some data points, like the description or reviews, are missing from the initial HTML response. That‘s because Walmart loads this content dynamically using JavaScript, to improve page load times.
To get this dynamic content, we need to either execute the JavaScript using a headless browser (which ScrapingBee handles for us), or find the API endpoints that return the data and request them directly.
By inspecting the network traffic in Chrome‘s developer tools, I was able to identify the relevant API endpoints:
- Product details:
https://www.walmart.com/terra-firma/item/<item-id>
- Reviews:
https://www.walmart.com/terra-firma/review-comments/<item-id>?sort=submission-desc&page=1
Here‘s how we can modify our script to fetch and parse this additional data:
import json
item_id = response.url.split(‘/‘)[-1]
details_url = f‘https://www.walmart.com/terra-firma/item/{item_id}‘
details_response = client.get(details_url)
details_data = json.loads(details_response.content)
description = details_data[‘short_description‘][‘values‘][0]
specifications = details_data[‘specifications‘][‘groups‘]
reviews_url = f‘https://www.walmart.com/terra-firma/review-comments/{item_id}?sort=submission-desc&page=1‘
reviews_response = client.get(reviews_url)
reviews_data = json.loads(reviews_response.content)
reviews = reviews_data[‘reviews‘]
This code extracts the item ID from the URL, constructs the API endpoints for product details and reviews, fetches the JSON responses, and parses out the relevant fields.
We now have all the key data points we need for a given product:
- Name
- Price
- Images
- Description
- Specifications
- Reviews
The final step is to save this data in a structured format like JSON or CSV, so it can be analyzed and consumed by other applications.
Scaling Up with Scrapy
So far, we‘ve focused on scraping a single product page at a time. But what if you need to extract data for thousands or millions of Walmart products?
To scrape at this scale, you‘ll want to use a framework like Scrapy that provides built-in support for concurrent requests, retries, and error handling. Scrapy also integrates well with proxies and services like ScrapingBee to avoid IP blocking.
Here‘s a basic Scrapy spider that extracts product data from a list of Walmart URLs:
import scrapy
from scrapingbee import ScrapingBeeClient
class WalmartSpider(scrapy.Spider):
name = ‘walmart‘
def start_requests(self):
urls = [
‘https://www.walmart.com/ip/...‘,
‘https://www.walmart.com/ip/...‘,
‘https://www.walmart.com/ip/...‘,
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
data = {
‘name‘: response.css(‘h1[itemprop="name"]::text‘).get(),
‘price‘: response.css(‘.price-characteristic::text‘).get(),
‘images‘: response.css(‘#main-image-carousel img::attr(src)‘).getall(),
‘description‘: response.css(‘#about-product-section::text‘).get(),
}
item_id = response.url.split(‘/‘)[-1]
details_url = f‘https://www.walmart.com/terra-firma/item/{item_id}‘
details_response = ScrapingBeeClient().get(details_url)
details_data = json.loads(details_response.content)
data[‘specifications‘] = details_data[‘specifications‘][‘groups‘]
reviews_url = f‘https://www.walmart.com/terra-firma/review-comments/{item_id}?sort=submission-desc&page=1‘
reviews_response = ScrapingBeeClient().get(reviews_url)
reviews_data = json.loads(reviews_response.content)
data[‘reviews‘] = reviews_data[‘reviews‘]
yield data
This spider defines a list of product URLs to scrape in the start_requests
method, and a parse
callback that extracts the relevant data from each page using CSS selectors.
It also makes additional requests to the product details and reviews APIs using ScrapingBee, and combines all the data into a single dictionary that is yielded as the result.
To run this spider, you would save it as a Python file (e.g. walmart_spider.py
) in your Scrapy project, and then run:
scrapy crawl walmart -o output.json
This command starts the walmart
spider, scrapes the specified URLs, and saves the results to a file called output.json
.
You can further customize the spider to handle pagination, retries, and logging by consulting the Scrapy documentation.
Walmart Web Scraping Tips & Tricks
To wrap up, here are some tips and best practices I‘ve learned from scraping Walmart and other e-commerce sites over the years:
Use ScrapingBee or a similar service to manage proxies, CAPTCHAs, and browser rendering. It‘s not worth the time and effort to build this infrastructure yourself.
Focus on parsing the JSON APIs rather than HTML pages whenever possible. The data is more structured and less likely to change over time.
Use Scrapy‘s built-in features like request retries, concurrency settings, and plugins to improve the reliability and performance of your scraper.
Store the scraped data in a database like MySQL or MongoDB rather than JSON/CSV files, so you can easily query and analyze it later.
Monitor your scraper‘s logs and metrics (e.g. request success rate, items scraped per minute) to detect issues and optimize performance.
Be mindful of Walmart‘s robots.txt file and terms of service. Don‘t scrape more aggressively than necessary, and consider caching data to reduce request volume.
Keep your scraper‘s code and dependencies up to date, as Walmart frequently changes its site structure and anti-bot measures.
By following these tips and the techniques outlined in this guide, you should be able to scrape data from Walmart.com reliably and efficiently.
Conclusion
Web scraping Walmart.com is a complex but rewarding endeavor, with valuable data on over 80 million products up for grabs. By understanding Walmart‘s site structure, bot detection methods, and API endpoints, we can build scrapers that extract this data at scale.
In this guide, we‘ve covered the key steps involved in scraping Walmart.com using Python:
- Fetching product pages using automated browsers and services like ScrapingBee
- Parsing HTML and JSON responses to extract product data
- Dealing with dynamic content loaded via JavaScript and APIs
- Scaling up the scraping process using the Scrapy framework
- Tips and best practices for reliable and efficient Walmart scraping
I hope this guide has been helpful in your web scraping journey. Feel free to reach out if you have any questions or feedback. Happy scraping!