Blog

Pre-processing text: R/tm vs. python/NLTK

  Let’s say that you want to take a set of documents and apply a computational linguistic technique.  If your method is based on the bag-of-words model, you probably need to pre-process these documents first by segmenting, tokenizing, stripping, stopwording, and stemming each one (phew, that’s a lot of -ing’s).  

  In the past, I’ve relied on NLTK to perform these tasks.  Python is my strongest language and NLTK is mature, fast, and well-documented.  However, I’ve been focusing on performing tasks entirely within R lately, and so I’ve been giving the tm package a chance.  So far, I’ve been disappointed with its speed (at least from a relative sense).  

  Here’s a simple example that hopefully some better R programmer out there can help me with.  I have been tracking tweets on the #25bahman hashtag (97k of them so far).  The total size of the dataset is 18M, so even fairly inefficient code should have no problem with this.  I want to build a corpus of documents where each tweet is a document and each document is a bag of stemmed, stripped tokens.  You can see the code in my repository here, and I’ve embedded two of the three examples below.

Big N.B.: The R code wasn’t done after 15 minutes.  My laptop has an i7 640M and 8G of RAM, so the issue isn’t my machine. The timing below actually excludes the last line of the embedded R example below.

Here are the timing results with 3 runs per example.  As you can see, R/tm is more than twice as slow to build the un-processed corpus as Python/NLTK is to build the processed corpus. The comparison only gets worse when you parallelize the Python code.

  1. Python, unparallelized: ~1:05
  2. Python, parallelized (8 runners): ~0:45
  3. R, unparallelized: ~2:15 (just to load the documents)
  4. Update: R, unparallelized: ~29 minutes (to pre-process all documents)

  So, dear Internet friends, what can I do? Am I using tm properly? Is this just a testament to the quality of NLTK? Am I cursed to forever write pre-processing code in one language and perform analysis in another?

12 Comments

  1. wrobell

    what about using python, then feeding the results to R via rpy2?

    1. mjbommar

      Hi,
      I think that’s the fastest solution that combines the two languages, but my hope was that R/tm would scale well enough on its own.

  2. mm

    another approach would be to call Python directly with the system/c interface in R –similar to some functions in this package http://gking.harvard.edu/readme (look into the ‘undergrad’ function in prototype.R file).

  3. John Myles White

    This is totally consistent with my own experiences with text processing in R. R just goes oddly slowly when dealing with strings. If you find any solutions, I’d love to hear about them.

  4. X

    Have you tried to contact the authors of tm about this? Maybe they can help.

  5. Trey

    Did you ever make any progress with this? I am facing the same issue, but R is my stronger language.

    1. Michael J Bommarito II

      Hi Trey,
      My subjective experience has been that tm is faster on smaller documents, but still exceedingly slow for larger documents.

      1. Trey

        Thanks for the followup — looks like a multiple language workflow lies ahead.

  6. Nadia FĂ©lix

    Hello,

    In tweets, there are many terms that have duplicated characteres. Example : loveeeeee. Do you know how to process this terms and replaced for example love?

    Thanks a lot

    1. Michael J Bommarito II

      Hi Nadia,
      You could try to use a stemming algorithm, but, in general, twitter text is very difficult to process for reasons like this.

  7. Michael Smith

    Would be interesting to compare the speed to TextAnalysis.jl, which is written in Julia.

    1. Michael Bommarito

      Great idea. If I get some time this weekend, I’ll update my ancient Julia install and give it a whirl.

Leave a Comment

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>