How to Scrape Twitter Data Using Python and Selenium in 2023

Twitter has become a goldmine of valuable data for researchers, marketers, and data scientists. From analyzing sentiment around certain topics to identifying influential users in a network, the insights gained from Twitter data can be tremendously powerful. However, accessing all that data isn‘t always easy.

While Twitter does provide official APIs, they are rate-limited and getting approval to access them can be a slow process. Many developers needing to extract large amounts of data from Twitter in a flexible way have turned to web scraping instead.

In this tutorial, we‘ll walk through how to use Python and Selenium to programmatically scrape data from Twitter profiles. Selenium allows you to automate interacting with web pages through a browser, which is useful for scraping dynamic sites like Twitter that load content via JavaScript. We‘ll go over all the key elements involved in locating, extracting, and structuring the desired data.

Setting Up Your Scraping Environment

Before we dive into the code, you‘ll need to make sure you have a few prerequisites installed and set up:

  • Python 3.x
  • Selenium package
  • WebDriver for the browser you want to use (e.g. ChromeDriver for Chrome)
  • A Twitter account

If you don‘t already have Python installed, download the latest version from the official Python website. Then install Selenium by running:

pip install selenium

Next download the WebDriver executable for your browser of choice. For this example we‘ll use ChromeDriver. Check your version of Chrome, then download the corresponding version of ChromeDriver from the downloads page.

Finally, you‘ll need a Twitter account that you can log into. It‘s best to create a separate account from your personal one for scraping purposes. Keep in mind that Twitter may temporarily or permanently suspend accounts they identify as scraping bots, so don‘t use an account tied to anything important.

Architecting the Scraping Process

Once your environment is ready, let‘s map out the key components we need to build:

  1. Initiate a Selenium WebDriver instance and navigate to a Twitter profile page
  2. Wait for the page and tweets to fully load
  3. Locate and extract the desired data points:
    • Profile name, handle, bio, location, website, join date
    • Follower and following counts
    • Tweet and retweet text
    • Reply, like, and retweet counts per tweet
  4. Handle pagination to extract data past the first page of tweets
  5. Structure and save the extracted data

We‘ll go through each of these steps in detail, but first let‘s set up a new Python file and import the necessary Selenium modules:

from selenium import webdriver  
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

Navigating to a Twitter Profile

First we need to initialize the Selenium WebDriver and instruct it to pull up a specific Twitter profile. Using ChromeDriver that would look like:

driver = webdriver.Chrome(‘/path/to/chromedriver‘)
driver.get(‘https://twitter.com/username‘)

Make sure to replace /path/to/chromedriver with the actual path where you installed the ChromeDriver executable, and replace username with the actual Twitter handle of the profile you want to scrape.

Waiting for Page & Tweets to Load

One of the tricky things about scraping Twitter is that much of the page content loads dynamically via JavaScript after the initial HTML loads. To ensure the elements we want to extract are present, we need to wait until the page is fully loaded before proceeding.

We can do this by using a WebDriverWait in combination with an ExpectedCondition:

WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, ‘[data-testid="tweet"]‘)))

Here we‘re telling Selenium to wait up to 10 seconds for an element matching the CSS selector [data-testid="tweet"] to be present on the page. That selector will match the <div> that contains an individual tweet. So once it‘s found, we know the initial set of tweets has loaded.

Locating and Extracting Data

Now we‘re ready for the core part – finding and pulling out the data points we want. Using Chrome DevTools to inspect the page source, we can identify the selectors needed to locate each element.

For example, the profile full name can be found in an <a> tag with the attribute data-testid="UserName". So we can extract it in Selenium like:

name = driver.find_element(By.CSS_SELECTOR, ‘[data-testid="UserName"]‘).text

Similarly, getting the follower count looks like:

followers = driver.find_element(By.XPATH, ‘//a[contains(@href,"followers")]/span[1]/span[1]‘).text

We‘re using an XPath here to find the <span> containing the follower number, by looking for an <a> tag with "followers" in its href attribute. The full XPath is a bit complicated because the follower count number is actually buried a couple <span> levels deep.

To extract the tweets, we‘ll use find_elements to get all the matching tweet <div>s, then iterate through them to pull out the data for each one:

tweets = driver.find_elements(By.CSS_SELECTOR, ‘[data-testid="tweet"]‘)
for tweet in tweets:
    text = tweet.find_element(By.CSS_SELECTOR, ‘div[data-testid="tweetText"]‘).text
    reply_count = tweet.find_element(By.CSS_SELECTOR, ‘div[data-testid="reply"]‘).text
    retweet_count = tweet.find_element(By.CSS_SELECTOR, ‘div[data-testid="retweet"]‘).text
    like_count = tweet.find_element(By.CSS_SELECTOR, ‘div[data-testid="like"]‘).text  

You can expand on this to extract even more data points like the tweet timestamp, any media/links in the tweet, the nested reply tweets, etc.

Handling Pagination

Twitter dynamically loads more tweets as you scroll down the page. To get beyond the first set of tweets, we need to simulate scrolling and wait for the next batch to load.

We can do this by using Selenium to execute some JavaScript to scroll the page, then wait for the next set of tweets to appear:

while True:
    driver.execute_script(‘window.scrollTo(0, document.body.scrollHeight);‘)
    try:
        WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.XPATH, ‘//div[@data-testid="tweet" and @tabindex="0"]‘)))
    except:
        break

This script will keep scrolling to the bottom of the page until no new tweets appear after waiting 5 seconds. You may need to tweak the wait time depending on your internet speed.

Saving the Extracted Data

As you extract the data elements, you‘ll likely want to save them somewhere for further analysis, whether that‘s a simple CSV file or a database.

Using Python‘s csv module, writing the tweets to a file would look something like:

import csv

with open(‘tweets.csv‘, ‘w‘, newline=‘‘, encoding=‘utf-8‘) as f:
    header = [‘name‘, ‘username‘, ‘text‘, ‘replies‘, ‘retweets‘, ‘likes‘] 
    writer = csv.writer(f)
    writer.writerow(header)
    for tweet in tweets:
        writer.writerow([name, username, text, reply_count, retweet_count, like_count])

Be sure to use encoding=‘utf-8‘ when opening the file, otherwise any non-ASCII characters in the tweets may cause errors.

Putting It All Together

Here‘s a condensed version of the full script with all the pieces:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome(‘/path/to/chromedriver‘)

def scrape_profile(username):

    driver.get(f‘https://twitter.com/{username}‘)
    name = driver.find_element(By.CSS_SELECTOR, ‘[data-testid="UserName"]‘).text
    bio = driver.find_element(By.CSS_SELECTOR, ‘[data-testid="UserDescription"]‘).text
    location = driver.find_element(By.CSS_SELECTOR, ‘[data-testid="UserLocation"]‘).text
    website = driver.find_element(By.CSS_SELECTOR, ‘[data-testid="UserUrl"]‘).text
    join_date = driver.find_element(By.CSS_SELECTOR, ‘[data-testid="UserJoinDate"]‘).text
    following = driver.find_element(By.XPATH, ‘//a[contains(@href,"following")]/span[1]/span[1]‘).text
    followers = driver.find_element(By.XPATH, ‘//a[contains(@href,"followers")]/span[1]/span[1]‘).text

    WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, ‘[data-testid="tweet"]‘)))

    tweets = []
    while True:
        page_tweets = driver.find_elements(By.CSS_SELECTOR, ‘[data-testid="tweet"]‘)
        for tweet in page_tweets:
            text = tweet.find_element(By.CSS_SELECTOR, ‘div[data-testid="tweetText"]‘).text
            reply_count = tweet.find_element(By.CSS_SELECTOR, ‘div[data-testid="reply"]‘).text
            retweet_count = tweet.find_element(By.CSS_SELECTOR, ‘div[data-testid="retweet"]‘).text
            like_count = tweet.find_element(By.CSS_SELECTOR, ‘div[data-testid="like"]‘).text
            tweets.append({‘text‘: text, ‘replies‘: reply_count, ‘retweets‘: retweet_count, ‘likes‘: like_count})

        driver.execute_script(‘window.scrollTo(0, document.body.scrollHeight);‘)
        try:
            WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.XPATH, ‘//div[@data-testid="tweet" and @tabindex="0"]‘)))
        except:
            break

    return {
        ‘name‘: name,
        ‘bio‘: bio,
        ‘location‘: location, 
        ‘website‘: website,
        ‘join_date‘: join_date,
        ‘following‘: following,
        ‘followers‘: followers,
        ‘tweets‘: tweets 
    }

profile_data = scrape_profile("username")
print(profile_data)

driver.quit()

There‘s always room for improvement and additional error handling, but this covers the core flow of scraping a Twitter profile with Selenium. Some ideas to expand on this:

  • Scrape multiple profiles in a loop or concurrently with multi-threading
  • Handle profiles with very large numbers of tweets more efficiently
  • Expand data points to include media, links, reply threads, etc
  • Integrate with a database or pipeline for further analysis and visualization

Ethical Considerations

While this tutorial focused on the technical aspects, it‘s important to consider the ethics and legality of web scraping. Before scraping any website, you should check the terms of service to ensure you‘re not violating them. Many websites prohibit scraping in their TOS.

Even in cases where it‘s legally permitted, think about whether mass data collection is ethical, especially when it includes personal information. And be sure to follow basic ethical bot etiquette – don‘t hammer sites with requests too quickly, properly identify your scraper, and avoid scraping any sensitive/protected info.

Twitter in particular has been known to ban accounts they suspect of using automated scraping. So if you‘re scraping large amounts of Twitter data for an extended period of time, you may want to look into getting approval for academic research access to their API instead.

Open Source Scraping Tools

If you need more robust Twitter scraping capabilities beyond a basic script, there are some great open source tools built with Selenium to automate the process:

  • twitter-scraper: Scrapes tweets based on search terms or profiles with optional date ranges, follower lookahead, and more.

  • kscraper: Selenium-based Twitter scraper that handles login, search, profile scraping, and organizing data.

  • Twitter Scraper Selenium: Twitter scraper using Selenium to automate browser actions and BeautifulSoup for parsing data.

Depending on your particular use case and the complexity of data you need to extract, these tools can be a great starting point vs building everything from scratch yourself. Just be sure to review the code and understand what it‘s doing before running it.

Conclusion

Web scraping Twitter data using Python and Selenium is a powerful technique for collecting data for analysis, research, or archival purposes. With some knowledge of HTML and CSS selectors, you can precisely target and extract the desired data points from Twitter‘s frontend.

Some key considerations to keep in mind when scraping Twitter:

  • Twitter‘s frontend is highly dynamic, so waiting for page loads and using scrolling to paginate is crucial
  • Selectors may change over time, so using specific attributes like data-testid is more stable than classes
  • Rate limiting your requests and rotating user agents/IP addresses can help avoid detection and bans
  • Respect website terms of service and don‘t abuse scraping to harvest personal data unethically

Selenium is a powerful tool for scraping all sorts of dynamic web pages, not just Twitter. Adapting the techniques from this tutorial, you can extract valuable data from many sites that are tricky to scrape with plain HTTP requests.

Hopefully this guide gives you a solid foundation for getting started with web scraping Twitter. As always, happy coding!

Did you like this post?

Click on a star to rate it!

Average rating 0 / 5. Vote count: 0

No votes so far! Be the first to rate this post.