Pre-processing text: R/tm vs. python/NLTK

Let’s say that you want to take a set of documents and apply a computational linguistic technique. If your method is based on the bag-of-words model, you probably need to pre-process these documents first by segmenting, tokenizing, stripping, stopwording, and stemming each one (phew, that’s a lot of -ing’s).

In the past, I’ve relied on NLTK to perform these tasks. Python is my strongest language and NLTK is mature, fast, and well-documented. However, I’ve been focusing on performing tasks entirely within R lately, and so I’ve been giving the tm package a chance. So far, I’ve been disappointed with its speed (at least from a relative sense).

Here’s a simple example that hopefully some better R programmer out there can help me with. I have been tracking tweets on the #25bahman hashtag (97k of them so far). The total size of the dataset is 18M, so even fairly inefficient code should have no problem with this. I want to build a corpus of documents where each tweet is a document and each document is a bag of stemmed, stripped tokens. You can see the code in my repository here, and I’ve embedded two of the three examples below.

Big N.B.: The R code wasn’t done after 15 minutes. My laptop has an i7 640M and 8G of RAM, so the issue isn’t my machine. The timing below actually excludes the last line of the embedded R example below.

Here are the timing results with 3 runs per example. As you can see, R/tm is more than twice as slow to build the un-processed corpus as Python/NLTK is to build the processed corpus. The comparison only gets worse when you parallelize the Python code.

Python, unparallelized: ~1:05
Python, parallelized (8 runners): ~0:45
R, unparallelized: ~2:15 (just to load the documents)
Update: R, unparallelized: ~29 minutes (to pre-process all documents)

So, dear Internet friends, what can I do? Am I using tm properly? Is this just a testament to the quality of NLTK? Am I cursed to forever write pre-processing code in one language and perform analysis in another?

15 Comments

wrobell 2011-02-18 at 06:18 - Reply

what about using python, then feeding the results to R via rpy2?
- mjbommar 2011-02-18 at 06:42 - Reply
  
  Hi,
  I think that’s the fastest solution that combines the two languages, but my hope was that R/tm would scale well enough on its own.
mm 2011-02-19 at 17:54 - Reply

another approach would be to call Python directly with the system/c interface in R –similar to some functions in this package http://gking.harvard.edu/readme (look into the ‘undergrad’ function in prototype.R file).
John Myles White 2011-02-21 at 13:44 - Reply

This is totally consistent with my own experiences with text processing in R. R just goes oddly slowly when dealing with strings. If you find any solutions, I’d love to hear about them.
X 2012-06-13 at 07:53 - Reply

Have you tried to contact the authors of tm about this? Maybe they can help.
Trey 2012-07-08 at 18:59 - Reply

Did you ever make any progress with this? I am facing the same issue, but R is my stronger language.
- Michael J Bommarito II 2012-07-08 at 19:57 - Reply
  
  Hi Trey,
  My subjective experience has been that tm is faster on smaller documents, but still exceedingly slow for larger documents.
  - Trey 2012-07-08 at 21:32 - Reply
    
    Thanks for the followup — looks like a multiple language workflow lies ahead.
Nadia Félix 2013-05-31 at 14:25 - Reply

Hello,

In tweets, there are many terms that have duplicated characteres. Example : loveeeeee. Do you know how to process this terms and replaced for example love?

Thanks a lot
- Michael J Bommarito II 2013-06-03 at 19:42 - Reply
  
  Hi Nadia,
  You could try to use a stemming algorithm, but, in general, twitter text is very difficult to process for reasons like this.
Michael Smith 2014-08-15 at 09:33 - Reply

Would be interesting to compare the speed to TextAnalysis.jl, which is written in Julia.
- Michael Bommarito 2014-08-15 at 09:48 - Reply
  
  Great idea. If I get some time this weekend, I’ll update my ancient Julia install and give it a whirl.
  - IB 2014-12-06 at 23:32 - Reply
    
    Did you ever end up trying to run things with TextAnalysis.jl? If so, how did you fare?
Michael Lachanski 2015-05-14 at 17:49 - Reply

You might want to try simply downloading R 3.2; I believe the string processing facilities have been dramatically improved.
M 2016-03-21 at 03:14 - Reply

Check out quanteda. It’s very similar in functionality to tm, but is in my opinion much better designed from a software engineering perspective. Quanteda can also deal more easily with larger data sets.

Pre-processing text: R/tm vs. python/NLTK

About the Author: bcllc

15 Comments

Leave A Comment Cancel reply

Top Sliding Bar

Recent Tweets

Newsletter

Share This Story, Choose Your Platform!

About the Author: bcllc

15 Comments

Leave A Comment Cancel reply

Top Sliding Bar

Recent Tweets

Newsletter