I’ve posted three examples of Twitter hashtags datasets in the last week: one on China, one on Iran, and one on Algeria. In order to build these datasets, I needed to obtain older tweets; this is slightly more difficult than simply filtering the streaming feed for your hashtag of choice. The original code I wrote for this task is in Python and is well-parallelized. However, the code isn’t commented and looks more complicated than it is due to parallelization choices.
As part of my recent exercise to replace Python with R for entire tasks, I decided to rewrite this code using R tonight. The code is pretty simple, well-commented, and consists of two functions – loadTag and downloadTag. Both of these methods are embedded in the script below the break.
There is one significant issue with the code, however. At the moment, neither rjson nor RJSONIO seem to support Unicode data in JSON responses. Furthermore, when character vectors of "unknown" encoding are written to file with a function like write.table, they produce output that cannot be reliably read back into R. As a result, the code below does not retain the text of a tweet – only the id, date, and username. This script can be easily modified to include other JSON variables by modifying the lambda function in line 70.
Once you’ve downloaded some data, producing figures like the ones in the posts above is only two lines away: tweets <- loadTag(tag) and ggplot(data=tweets, aes(x=as.POSIXct(date))) + geom_bar(aes(fill=..count..), binwidth=60*5). Here’s the current figure of 5-minute frequencies since the 20th (x-axis is EST unlike the previous post, in which it was UTC. Pass tz="UTC" to as.POSIXct to change this).
If you’ve got working Python code, why not use RSPython and drive the R code as a server in your Python main line?
I would normally use RPy2 or just separate interpreters in sequence, but I’ve been challenging myself to work within the limits of R lately. Python is clearly the better language natively for this :)
P.S.: I’ve got working Perl code that will grab tweets via Twitter search and flatten the JSON objects to a CSV file if you’re interested. It’s on Github at
https://github.com/znmeb/Project-Kipling/blob/master/Desktop/Demos/History/History.pl
there is a R package that might be useful:
http://cran.r-project.org/web/packages/twitteR/index.html
From what I saw, twitteR isn’t capable of using the max_id/since_id parameters, which is the only reasonable way to obtain historical tweets.
check out http://www.infochimps.com for historical twitter data (billions of tweets, millions of users), free API for up to 100,000 calls a month
I am trying to use your code to obtain older than five days hashtags, but I’m just able to get tweets for at most five days. I’m using the code as it is shown here. Is it possible to get older tweets? what am I missing? I will appreciate your help.