Archiving Tweets with Python

Last week, I posted some R code that downloads the user and timestamp of tweets that contain a given hashtag going back as far as Twitter search will allow. As I noted in the post, the text of these tweets isn’t stored because of encoding issues with R and its JSON packages. A few people emailed asking for a version of the code that can archive the tweet text as well, and so I cleaned up my Python code for the task. The code, as posted below the break and on GitHub, supports resuming downloads and only uses standard Python libraries. You should be able to copy the methods and start downloading with just a call like doSearch("#ff") or doSearch("#feb17").

By bcllc|2011-02-26T15:20:18-05:00February 26th, 2011|Programming|13 Comments

About the Author: bcllc

13 Comments

Glenn Ferrell 2011-03-02 at 17:04 - Reply

Michael,
Looked for an RSS feed on your site and couldn’t seem to find one. Do you have one ?
(Not interested in email subscription – just something I can see in a reader.)

Thanx much, Glenn (@glenn_ferrell)
QuantTrader 2011-03-25 at 12:27 - Reply

Hi mjbommar!

thank you for this fantastic piece of code. It’s really helpful.

Cheers,

QuantTrader
Daniel Cedeño 2011-05-22 at 21:17 - Reply

Thanks a lot for this piece of code!!! It`s really helpfull!!!
Daniel Cedeño 2011-05-24 at 23:44 - Reply

Hi, i have a little problem with the code!!! I hope that you can help me!!! The firts time that i running archiving a lot of interesting data!!! Its very amazing and quickly!!! Then program a sequence that runs the code every day in Windows-Python 2.7 installation, but I get no additional information …

When i check directly in Twitter the search have this message: {“results”:[],”max_id”:71789121567334401,”since_id”:69250581910388736,”refresh_url”:”?since_id=71789121567334401&q=EXCEL”,”results_per_page”:100,”page”:1,”completed_in”:0.017846,”warning”:”adjusted since_id to 69250581910388736 (), requested since_id was older than allowed”,”since_id_str”:”69250581910388736″,”max_id_str”:”71789121567334401″,”query”:”EXCEL”}

Can you help me?

P.D. Sorry for my english… :)
Daniel Cedeño 2011-05-24 at 23:59 - Reply

Sorry, i forget mentioning that:

In IDLE Python when i run the code to test i see this message:

doSearch: !nextPage, maxID=71789121567334401
{‘q’: ‘EXCEL’, ‘rpp’: 100, ‘max_id’: 71789121567334401L}
doQuery: Fetching http://search.twitter.com/search.json?q=EXCEL&rpp=100&max_id=71789121567334401
len(tweets) = 1 => breaking.
Wilson Kiw 2011-06-28 at 08:52 - Reply

great post. is there a similar functionality for facebook data? to see if people are mentioning a term in their status? I’ve visualised some twitter data using your technique here http://www.tips-for-excel.com/twitter-data/
Would be great if I could add Facebook data and then monitor trends across both.
Mark Antonio 2011-12-04 at 08:50 - Reply

Thanks for the code.
Just wondering how would i keep the data as JSON and not export it to Excel?
- mjbommar 2011-12-04 at 22:46 - Reply
  
  Hi Mark,
  The doQuery() method contains the bulk of the code necessary to only store the JSON. Take everything from there up to the json.load method.
GEORGE 2012-01-30 at 00:29 - Reply

Hi Michael, great post. It seems that you used the ggplot2 library to make the graph, so I wonder how did you do to order the user´s increasing. I´ve tryed to do it but it doesn´t work.
Sorry for my English.
- Michael J Bommarito II 2012-01-30 at 00:45 - Reply
  
  Hi George,
  I’m sure there are many ways, but here’s how I did it:
  # Now build the table of most frequent tweeters numTop < - 30 userFrequency <- arrange(as.data.frame(table(tweets$user)), -Freq) names(userFrequency) <- c("Name", "Freq") userFrequencyTop <- userFrequency[1:numTop, ] userFrequencyTop$Name <- factor(userFrequencyTop$Name, levels=userFrequencyTop$Name)
Ed 2012-05-23 at 15:43 - Reply

Forgive my novice-ness. I’ve run the code a few times trying to get as many tweets as possible. When I originally ran it, I only got about 100. So I changed the rpp=10000. This still only yielded about 1,970 tweets. I am looking at the #nonato hashtag and would like to get all the tweets with it for the past week. How can I modify the code to do this? Or it being limited by twitter’s search function?
- Michael J Bommarito II 2012-05-23 at 16:22 - Reply
  
  Hi Ed,
  There are a few things to note: 1) the code will not fetch tweets that are newer than the largest ID you have already retrieved. It’s possible you got unlucky and twitter gave you an old ID to start. You can create a new tweets.csv file and merge them later. 2) Twitter’s search results are inconsistent, but you should be able to go further back and get more than that. I updated this data yesterday and had around 150k tweets from May 12 through then.
Gagandeep Singh 2014-08-18 at 01:37 - Reply

Hi Michael,

I want to extract historical tweets of lets say from Jan 1 to March 31. Can it be possible? Can you guide me how to modify the above code to do this?

About the Author: bcllc

13 Comments

Leave A Comment Cancel reply

Top Sliding Bar

Recent Tweets

Newsletter

Share This Story, Choose Your Platform!

About the Author: bcllc

13 Comments

Leave A Comment Cancel reply

Top Sliding Bar

Recent Tweets

Newsletter