Blog — Jared Joseph

In regards to getting academic work out for a wider audience, I think effective visualizations are the most valuable things I can spend my time on. To try and practice this skill, I participated in the UC Davis Data Science Initiative’s proposition visualization challenge along with a fellow sociology grad student. The goal of the challenge was to take a proposition on California’s November 6, 2018 ballot, and find a way to visualize it that would help people decide how to vote. You can see it at the bottom of the post, and read about it’s creation.

My partner and I decided to tackle proposition 11, which concerned emergency medial workers. We decided on this one as my partner studies labor and work, while I used to be a Emergency Medical Technician (EMT). As an overview, the proposition would allow ambulance providers to force workers to remain on call during their breaks, and require them to provide some mental health services and training. This all sounds good, so much so that there wasn’t even an official no position submitted to the California voters guide. Things are rarely that simple.

The main supporter and near sole funder of the proposition was American Medical Response (AMR), one of the largest private ambulance services in the US, which spent $30 million on the campaign. The company had some pending lawsuits that would be dismissed if this proposition goes though. Many of the features of this proposition were also already provided to medial personnel, like the training and mental health services. Something was off.

It proved particularly difficult to create a meaningful and informative visualization. We thought of looking at response times, pay for workers, the effect of break or lack thereof on worker health and alertness, but some element was always missing. We knew we wanted to find some way to show what people were saying on the no side, since there was no official statement sent out to voters in the official voting guidebook. We settled on using Twitter to find out how people were talking about the proposition.

I created a bot to scrape twitter every 5 hours for hashtags related to the proposition (#NoOn11, #YesOn11, etc.). After a few days we had a surprising low number of tweets, which may have been caused by the limitations of twitter’s free API. With what we did have, we wanted to showcase how people who supported and opposed the proposition were talking about it. We tried sorting purly based on hash tag, but that got muddeled when people were insulting the other side’s hashtag, or were using the neutral #prop11 tag while taking a side in the text of their tweet. I tried a simple string search based on using the words “yes” or “no” in the tweet, but that again resulted in false classifications. I ended up making an overly complicated solution, but one that I learned from.

I implemented my first “complex” machine learning algorithm, and created a neural net classifier based on the text of the tweets themselves. This would sort them into Yes, No, and Neutral tweets. It’s not perfect, and I know I have too little data and probably violated some assumptions, but this was a small hack-ey challenge and a good space to experiment. It ended up working well enough. I took these classified tweets, and compared each to the text of the official Yes statement, to see how much the statement was steering the conversation around the proposition, considering there was no opposing statement.

The answer? None of the tweets really matched the Yes statement; not tweets for or against. But, at least it provided away to show the volume of tweets for and against, and to see the text of each so that those who were opposed could have their position known. It was a quick challenge, and each ground had their own flavor of jankiness, but everyone involved learned something new. I think it was a great exercise, and our group did end up winning the challenge, which was nice.

Here you can see the visualization. The size of the dots are representative of the number of times that tweet was favorited. The classification wasn’t perfect, but you can mouse over the dots to read the text of the tweet! If you wanted to see how I made this, you can find the github repository here.

The complaint I've heard most often about R is that it is slow compared to other languages. While it has its own set of tricks like the apply family of functions, sometimes it can still slow to a crawl on larger tasks. I encountered one such scenario recently as I was going through the process of legitimatizing text, or reducing words to their dictator forms (i.e. studying and studied both become study). Because the package I was using was really just a wrapper to an external executable, none of Rs usual tricks were of much use, as it would still have to feed data one chunk at a time to the external exe.

I was at a loss as to how I could speed this process up, as nothing I knew of could boost the I/O of the external program aside from a faster server. I ultimately decided if I couldn't make it faster, I could always just do it more. I started by manually chunking my data into ten parts, and running my code over each one manually in separate instances of R (a bad idea rife with possibilities for error, I don't recommend it). This of course was not successful, and would have still taken 220 hours at the absolute minimum. So what to try next?

Well, how about I tell R to make its own clone instances? After talking with some colleagues, I learned for the 'foreach' family of parallelization, a neat take on the classic loop. A foreach loop requires some setup, in making a cluster of child R sessions (limited by the number of cores in your processor). After you designate this core, the foreach loop send one loop to each of these child sessions, and then stitches the results from each back together. Now with 30 clone instances running, I was hoping it would be done 30 times faster, but unfortunately I/O still takes time. Still, going from an ideal 220 hours to overnight isn't bad.

You can find the code I used on my GitHub.

Democracy Visualized

Parallelization in R