cURL is a powerful and popular command-line tool for sending HTTP requests and scraping data from websites. As a web scraping expert, I often find it helpful to run cURL commands from within Python scripts in order to automate repetitive tasks. In this article, I‘ll walk through the different ways you can run cURL in Python and share examples of how to apply this technique to real web scraping scenarios.
Two Ways to Run cURL in Python
There are two main approaches to running cURL commands in Python:
Use Python‘s built-in subprocess module to execute cURL as a separate command-line process
Use a Python library like pycurl or requests that provides a wrapper around libcurl
The subprocess method launches cURL as its own process, separate from your Python script. This can be a quick and easy way to fire off a cURL command without much overhead. Here‘s an example:
import subprocess
output = subprocess.check_output(["curl", "-X", "GET", "https://example.com"])
print(output)
The downside of this approach is that it doesn‘t integrate cURL very tightly with your Python code. You have to pass arguments as a list of strings and capture the output.
Using a library like pycurl or requests provides a more Pythonic interface to cURL. It allows you to configure your HTTP requests in detail using native Python code. For example, here‘s how to make the same request using the requests library:
import requests
r = requests.get("https://example.com")
print(r.text)
Personally, I recommend using Python‘s requests library in most cases, as it abstracts away some of cURL‘s complexities while still giving you plenty of control. The subprocess method can be handy for quickly testing cURL commands before converting them to Python code.
Converting cURL Commands to Python Requests
Speaking of converting cURL to Python, let‘s walk through an example of how to translate a cURL command into equivalent Python requests code.
Here‘s a sample cURL command that sends a POST request with some JSON data:
curl -X POST -H "Content-Type: application/json" -d ‘{"username":"xyz","password":"123"}‘ https://example.com/api/login
To convert this to Python requests code, we‘ll use the requests.post() method and configure it to match the cURL command:
import requests
url = "https://example.com/api/login"
payload = {"username":"xyz", "password":"123"}
headers = {"Content-Type": "application/json"}
r = requests.post(url, json=payload, headers=headers)
print(r.text)
As you can see, the requests library allows us to naturally express the same HTTP request that we made with cURL. We can define the URL, payload, and headers as Python variables and then pass them into the requests.post method.
I find it‘s usually easier to write out my requests in cURL first to test them and then convert them to Python as the second step. There are also online tools like curlconverter.com that will automatically translate cURL commands to Python requests code.
Using cURL in Python with ScrapingBee
One of the challenges of using Python and cURL for web scraping is that some websites will block your requests if they suspect you are a bot. An effective way around this is to route your cURL requests through a web scraping API like ScrapingBee.
By sending cURL requests to the ScrapingBee API, which then forwards them to the target website, you can avoid having your IP address blocked. Here‘s an example of how to use ScrapingBee with Python requests:
import requests
r = requests.get(
url=‘https://app.scrapingbee.com/api/v1/‘,
params={
‘api_key‘: ‘YOUR_API_KEY‘,
‘url‘: ‘https://example.com‘,
‘render_js‘: ‘false‘,
‘wait_for‘: ‘3000‘
},
)
print(r.text)
This sends your request through ScrapingBee‘s servers to fetch the content from the specified URL. You can configure options like whether to render JavaScript and how long to wait for the page to load.
ScrapingBee provides 1,000 free API requests to get started. For large scale scraping jobs, it can be a valuable tool in your toolkit to outsource the work of maintaining proxies and avoiding rate limiting issues.
Real-World Web Scraping Example with cURL and Python
To bring everything together, let‘s walk through a more complex, real-world example of using Python and cURL to scrape data from a website. We‘ll fetch the latest stories from Hacker News and extract their titles and links.
Here‘s a cURL command to get the current Hacker News front page:
curl https://news.ycombinator.com/
When I run this on the command line, I get back the raw HTML content of the front page. Now let‘s translate this to Python requests code and parse out the data we want:
import requests
from bs4 import BeautifulSoup
url = "https://news.ycombinator.com"
r = requests.get(url)
soup = BeautifulSoup(r.text, ‘html.parser‘)
links = soup.select(‘.titleline > a‘)
for link in links:
print(link.text)
print(link.get(‘href‘))
print()
This script does the following:
- Fetches the HTML content of the Hacker News homepage using requests
- Parses the HTML using BeautifulSoup
- Extracts all links with CSS class "titleline"
- Prints out each link‘s text and URL
When I run this, I get output like:
New AI Tells Jokes Like Humans Do
https://spectrum.ieee.org/this-new-ai-can-tell-jokes-like-humans-do-part-of-the-time-if-you-help-it
Single Page Applications Are Not Accessible
https://www.amberley.dev/blog/2023-04-14-single-page-applications-are-not-accessible/
Unix tools are ill suited for data science
https://alexey-popkov.medium.com/unix-tools-are-ill-suited-for-data-science-c20404fa0fc4
...
This illustrates a realistic example of how you can use cURL and Python together for web scraping. Fetching the initial data with cURL and then parsing it with Python libraries like BeautifulSoup and Pandas is a common workflow.
Conclusion
As a web scraping expert, I find cURL to be an indispensable tool, both on its own and in conjunction with Python. The ability to quickly test requests in cURL and then translate them to Python code helps me rapidly prototype and build web scrapers.
In this article, we covered the different ways to execute cURL from Python, how to convert between cURL and Python requests, and how tools like ScrapingBee can facilitate cURL scraping at scale. We walked through a real example of using cURL and Python together to scrape data from the Hacker News homepage.
I encourage you to try out these techniques the next time you need to automate accessing data from websites. With practice, you‘ll be able to efficiently extract data from even the most complicated sources using the power of cURL and Python!