Revisiting text processing with R and Python

  Back in 2011, I covered the relative performance difference of the most popular libraries for text processing in R and Python.   In case you can’t guess the answer, Python and NLTK  won by a significant margin over R and tm.  Text processing with R seemed simple on paper, but performance and flexibility limitations have kept me away since then except for very small corpora.

  Since then, R has garnered a huge amount of attention from a growing community of enterprise and academic users.  In 2011, the only mature text processing package was tm; now, with more and more big-name vendors like Oracle and HP piling marketing dollars into the language as a platform for big data analytics, you’d hope that the state of affairs would have improved.

  Sadly, they have not come far enough to make R practical for many tasks.  tm is still the most commonly used package, and much of the new work in text processing and natural language processing has built on tm (see reverse lists on the tm CRAN page).  If all you need are simple tokenization functions, the tau package does provide basic, efficient utilities here (although they simply wrap the built-in R regular expression methods).

  If you can live with rJava, you have a few more options – rWeka and openNLP both provide access to the Weka and Apache OpenNLP via JNI.  However, in my experience, the system constraints regarding rJava ( ambiguity, `env` sandboxing, architecture mismatches) and JNI+R copy performance haven’t made the effort worthwhile.

  At the end of the day, those of us working with large text corpora still need dual-language workflows to process text prior to classifying or learning in R.  R and its packages make the latter half of this work much easier, but my gut instinct is that scikit-learn, pylab, and pandas make Python a better single-language solution for most problems today.