Learnings from Initial Commit of Social Bots on Twitter Analysis

April 2, 2022 Kory Hayward

Since I published Are Social Bots Ruining (and Running) Discourse Online? I’ve been working on a couple Python scripts that I’ll use to automate data collection for the analysis later this year. If you haven’t read that post, I encourage you to do so.

If you want the TL;DR version of that post, here it is: There is a claim that true social bots — those that are indistinguishable from humans — have a discorporate effect on our online discourse, particularly around sensitive topics (politics, Covid-19 information, etc.). A set of tools have been created that claim to accurately discriminate between human and social bot. My hypothesis is that those claims are a bunch of bologna. So, I’m doing my own analysis using the 2022 mid-term elections as a natural experiment.

Now, if you’re curious about my progress and methods to date, check out the repo on GitHub. Below is a summary of my progress and issues encountered to date, in prose as pithy as I’m able to muster.

Securing Lists of Candidates in the 2022 Midterm Election

After some trial and error — determining if I can plug into an API to get information from AP, DealDesk, etc., or through some targeted web scraping — I settled on scraping Wikipedia pages related to each state’s federal 2022 elections using BeautifulSoup. Thankfully, the tables reporting winners in each primary election (for districts and senate seats) are the same across pages.

My script loops over a dictionary of pages, finds the appropriate table in each, selects the winner (thanks to a specific HTML tag) of each race, and appends that information to a list. After applying some light string manipulation to the list of winners, the state-specific output is written to a state-specific file.

Tracking Twitter Followers of General Election Candidates

This is where the real fun begins.

A quick aside: Years ago, right as the pandemic started, I was part of College Board’s Policy team. We were tapped to track shifts to school instruction models — in-person, hybrid, remote — for schools in the nation’s largest 250 districts, and report that information back to senior leadership and program teams. At first, six of us did this by hand. Each week we’d spend more than a half-a-day going district website by district website to capture and report on any changes.

I was painfully bored, and I knew there was a more efficient way to source and report this information. So, as someone new to coding with Python, I set myself the task to write a script that would somehow gather, structure, and report this Information to our team. I settled on leveraging Twitter’s API to gather districts’ tweets each day. I applied some NLP to discern what the tweets were saying and if they reported any shifts in instruction models. Now we were in business (and I got back nearly a full day of work).

All that to say: I’m thankful I had a Twitter Developer Account already that I could use. Taking the list of general election candidates, I pass that information to Twitter through its API. I quickly hit a snag: Twitter’s API does not currently allow users to see / differentiate between accounts based on the Election Label they apply to all congressional / state-wide candidates. So, I am forced to query Twitter to return a list of accounts (based on names in the list of general election candidates as search terms) that may be a candidate for office later this fall.

Unsurprisingly, this is a pretty involved process. I pass the list of names to Twitter using a for loop — iterating over each name in the list — and receive back a set of account-specific information that I use to manually process, differentiate, and discern general election candidates from the chaff. Until Twitter includes the ability to query its service to select candidates for office via an endpoint in the API, this manual processing is required. So, I’m in for a lot of late nights later this spring / summer. (I will happy accept any contribution to this work in the form of a bottle of pinot noir or sparkling wine.)

Once I’ve figured out which accounts are actually those I want to analyze, I pass those account IDs back to Twitter and request their followers’ account IDs. These are captured in a list and written to a file specific to each general election candidate.

Next Steps

I’ll continue to repeat this process as primaries are held: source general election candidates via Wikipedia, secure their Twitter account information, and download a list of their followers for analysis after the general election this November.

Beyond the analysis I’m already planning, I’m curious if anyone has opinions of or suggestions for additional analyses I should do in tandem. I’ve been kicking around the idea of a network analysis to determine the connection between followers of a given account and across accounts. Are some individuals following, engaging with others, and driving conversations online about specific candidates, particularly those outside their own congressional district? If so, what do these “super users” mean for our online discourse?

More to come! Stay tuned.