Let’s say that you want to take a set of documents and apply a computational linguistic technique. If your method is based on the bag-of-words model, you probably need to pre-process these documents first by segmenting, tokenizing, stripping, stopwording, and stemming each one (phew, that’s a lot of -ing’s).
In the past, I’ve relied on NLTK to perform these tasks. Python is my strongest language and NLTK is mature, fast, and well-documented. However, I’ve been focusing on performing tasks entirely within R lately, and so I’ve been giving the tm package a chance. So far, I’ve been disappointed with its speed (at least from a relative sense).
Here’s a simple example that hopefully some better R programmer out there can help me with. I have been tracking tweets on the #25bahman hashtag (97k of them so far). The total size of the dataset is 18M, so even fairly inefficient code should have no problem with this. I want to build a corpus of documents where each tweet is a document and each document is a bag of stemmed, stripped tokens. You can see the code in my repository here, and I’ve embedded two of the three examples below.
Big N.B.: The R code wasn’t done after 15 minutes. My laptop has an i7 640M and 8G of RAM, so the issue isn’t my machine. The timing below actually excludes the last line of the embedded R example below.
Here are the timing results with 3 runs per example. As you can see, R/tm is more than twice as slow to build the un-processed corpus as Python/NLTK is to build the processed corpus. The comparison only gets worse when you parallelize the Python code.
- Python, unparallelized: ~1:05
- Python, parallelized (8 runners): ~0:45
- R, unparallelized: ~2:15 (just to load the documents)
- Update: R, unparallelized: ~29 minutes (to pre-process all documents)
So, dear Internet friends, what can I do? Am I using tm properly? Is this just a testament to the quality of NLTK? Am I cursed to forever write pre-processing code in one language and perform analysis in another?
what about using python, then feeding the results to R via rpy2?
Hi,
I think that’s the fastest solution that combines the two languages, but my hope was that R/tm would scale well enough on its own.
another approach would be to call Python directly with the system/c interface in R –similar to some functions in this package http://gking.harvard.edu/readme (look into the ‘undergrad’ function in prototype.R file).
This is totally consistent with my own experiences with text processing in R. R just goes oddly slowly when dealing with strings. If you find any solutions, I’d love to hear about them.
Have you tried to contact the authors of tm about this? Maybe they can help.
Did you ever make any progress with this? I am facing the same issue, but R is my stronger language.
Hi Trey,
My subjective experience has been that tm is faster on smaller documents, but still exceedingly slow for larger documents.
Thanks for the followup — looks like a multiple language workflow lies ahead.
Hello,
In tweets, there are many terms that have duplicated characteres. Example : loveeeeee. Do you know how to process this terms and replaced for example love?
Thanks a lot
Hi Nadia,
You could try to use a stemming algorithm, but, in general, twitter text is very difficult to process for reasons like this.
Would be interesting to compare the speed to TextAnalysis.jl, which is written in Julia.
Great idea. If I get some time this weekend, I’ll update my ancient Julia install and give it a whirl.
Did you ever end up trying to run things with TextAnalysis.jl? If so, how did you fare?
You might want to try simply downloading R 3.2; I believe the string processing facilities have been dramatically improved.
Check out quanteda. It’s very similar in functionality to tm, but is in my opinion much better designed from a software engineering perspective. Quanteda can also deal more easily with larger data sets.