data science

Parallelization in R

The complaint I've heard most often about R is that it is slow compared to other languages. While it has its own set of tricks like the apply family of functions, sometimes it can still slow to a crawl on larger tasks. I encountered one such scenario recently as I was going through the process of legitimatizing text, or reducing words to their dictator forms (i.e. studying and studied both become study). Because the package I was using was really just a wrapper to an external executable, none of Rs usual tricks were of much use, as it would still have to feed data one chunk at a time to the external exe.

I was at a loss as to how I could speed this process up, as nothing I knew of could boost the I/O of the external program aside from a faster server. I ultimately decided if I couldn't make it faster, I could always just do it more. I started by manually chunking my data into ten parts, and running my code over each one manually in separate instances of R (a bad idea rife with possibilities for error, I don't recommend it). This of course was not successful, and would have still taken 220 hours at the absolute minimum. So what to try next?

Well, how about I tell R to make its own clone instances? After talking with some colleagues, I learned for the 'foreach' family of parallelization, a neat take on the classic loop. A foreach loop requires some setup, in making a cluster of child R sessions (limited by the number of cores in your processor). After you designate this core, the foreach loop send one loop to each of these child sessions, and then stitches the results from each back together. Now with 30 clone instances running, I was hoping it would be done 30 times faster, but unfortunately I/O still takes time. Still, going from an ideal 220 hours to overnight isn't bad.

You can find the code I used on my GitHub.