Pre-processing text: R/tm vs. python/NLTK

  Let’s say that you want to take a set of documents and apply a computational linguistic technique.  If your method is based on the bag-of-words model, you probably need to pre-process these documents first by segmenting, tokenizing, stripping, stopwording, and stemming each one (phew, that’s a lot of -ing’s).  

  In the past, I’ve relied on NLTK to perform these tasks.  Python is my strongest language and NLTK is mature, fast, and well-documented.  However, I’ve been focusing on performing tasks entirely within R lately, and so I’ve been giving the tm package a chance.  So far, I’ve been disappointed with its speed (at least from a relative sense).  

  Here’s a simple example that hopefully some better R programmer out there can help me with.  I have been tracking tweets on the #25bahman hashtag (97k of them so far).  The total size of the dataset is 18M, so even fairly inefficient code should have no problem with this.  I want to build a corpus of documents where each tweet is a document and each document is a bag of stemmed, stripped tokens.  You can see the code in my repository here, and I’ve embedded two of the three examples below.

Big N.B.: The R code wasn’t done after 15 minutes.  My laptop has an i7 640M and 8G of RAM, so the issue isn’t my machine. The timing below actually excludes the last line of the embedded R example below.

Here are the timing results with 3 runs per example.  As you can see, R/tm is more than twice as slow to build the un-processed corpus as Python/NLTK is to build the processed corpus. The comparison only gets worse when you parallelize the Python code.

  1. Python, unparallelized: ~1:05
  2. Python, parallelized (8 runners): ~0:45
  3. R, unparallelized: ~2:15 (just to load the documents)
  4. Update: R, unparallelized: ~29 minutes (to pre-process all documents)

  So, dear Internet friends, what can I do? Am I using tm properly? Is this just a testament to the quality of NLTK? Am I cursed to forever write pre-processing code in one language and perform analysis in another?

CEO and Founder of Bommarito Consulting. View Michael's profile here.

Tagged with: , ,
Posted in Programming
10 comments on “Pre-processing text: R/tm vs. python/NLTK
  1. wrobell says:

    what about using python, then feeding the results to R via rpy2?

    • mjbommar says:

      Hi,
      I think that’s the fastest solution that combines the two languages, but my hope was that R/tm would scale well enough on its own.

  2. mm says:

    another approach would be to call Python directly with the system/c interface in R –similar to some functions in this package http://gking.harvard.edu/readme (look into the ‘undergrad’ function in prototype.R file).

  3. This is totally consistent with my own experiences with text processing in R. R just goes oddly slowly when dealing with strings. If you find any solutions, I’d love to hear about them.

  4. X says:

    Have you tried to contact the authors of tm about this? Maybe they can help.

  5. Trey says:

    Did you ever make any progress with this? I am facing the same issue, but R is my stronger language.

  6. Nadia FĂ©lix says:

    Hello,

    In tweets, there are many terms that have duplicated characteres. Example : loveeeeee. Do you know how to process this terms and replaced for example love?

    Thanks a lot

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>