Blog

Archiving Tweets with Python

  Last week, I posted some R code that downloads the user and timestamp of tweets that contain a given hashtag going back as far as Twitter search will allow.  As I noted in the post, the text of these tweets isn’t stored because of encoding issues with R and its JSON packages.  A few people emailed asking for a version of the code that can archive the tweet text as well, and so I cleaned up my Python code for the task.  The code, as posted below the break and on GitHub, supports resuming downloads and only uses standard Python libraries. You should be able to copy the methods and start downloading with just a call like doSearch("#ff") or doSearch("#feb17").

12 Comments

  1. Glenn Ferrell

    Michael,
    Looked for an RSS feed on your site and couldn’t seem to find one. Do you have one ?
    (Not interested in email subscription – just something I can see in a reader.)

    Thanx much, Glenn (@glenn_ferrell)

  2. QuantTrader

    Hi mjbommar!

    thank you for this fantastic piece of code. It’s really helpful.

    Cheers,

    QuantTrader

  3. Daniel Cedeño

    Thanks a lot for this piece of code!!! It`s really helpfull!!!

  4. Daniel Cedeño

    Hi, i have a little problem with the code!!! I hope that you can help me!!! The firts time that i running archiving a lot of interesting data!!! Its very amazing and quickly!!! Then program a sequence that runs the code every day in Windows-Python 2.7 installation, but I get no additional information …

    When i check directly in Twitter the search have this message: {“results”:[],”max_id”:71789121567334401,”since_id”:69250581910388736,”refresh_url”:”?since_id=71789121567334401&q=EXCEL”,”results_per_page”:100,”page”:1,”completed_in”:0.017846,”warning”:”adjusted since_id to 69250581910388736 (), requested since_id was older than allowed”,”since_id_str”:”69250581910388736″,”max_id_str”:”71789121567334401″,”query”:”EXCEL”}

    Can you help me?

    P.D. Sorry for my english… :)

  5. Daniel Cedeño

    Sorry, i forget mentioning that:

    In IDLE Python when i run the code to test i see this message:

    doSearch: !nextPage, maxID=71789121567334401
    {‘q’: ‘EXCEL’, ‘rpp’: 100, ‘max_id’: 71789121567334401L}
    doQuery: Fetching http://search.twitter.com/search.json?q=EXCEL&rpp=100&max_id=71789121567334401
    len(tweets) = 1 => breaking.

  6. Wilson Kiw

    great post. is there a similar functionality for facebook data? to see if people are mentioning a term in their status? I’ve visualised some twitter data using your technique here http://www.tips-for-excel.com/twitter-data/
    Would be great if I could add Facebook data and then monitor trends across both.

  7. Mark Antonio

    Thanks for the code.
    Just wondering how would i keep the data as JSON and not export it to Excel?

    1. mjbommar

      Hi Mark,
      The doQuery() method contains the bulk of the code necessary to only store the JSON. Take everything from there up to the json.load method.

  8. GEORGE

    Hi Michael, great post. It seems that you used the ggplot2 library to make the graph, so I wonder how did you do to order the user´s increasing. I´ve tryed to do it but it doesn´t work.
    Sorry for my English.

    1. Michael J Bommarito II

      Hi George,
      I’m sure there are many ways, but here’s how I did it:
      # Now build the table of most frequent tweeters
      numTop < - 30

      userFrequency <- arrange(as.data.frame(table(tweets$user)), -Freq)
      names(userFrequency) <- c("Name", "Freq")
      userFrequencyTop <- userFrequency[1:numTop, ]
      userFrequencyTop$Name <- factor(userFrequencyTop$Name, levels=userFrequencyTop$Name)

  9. Ed

    Forgive my novice-ness. I’ve run the code a few times trying to get as many tweets as possible. When I originally ran it, I only got about 100. So I changed the rpp=10000. This still only yielded about 1,970 tweets. I am looking at the #nonato hashtag and would like to get all the tweets with it for the past week. How can I modify the code to do this? Or it being limited by twitter’s search function?

    1. Michael J Bommarito II

      Hi Ed,
      There are a few things to note: 1) the code will not fetch tweets that are newer than the largest ID you have already retrieved. It’s possible you got unlucky and twitter gave you an old ID to start. You can create a new tweets.csv file and merge them later. 2) Twitter’s search results are inconsistent, but you should be able to go further back and get more than that. I updated this data yesterday and had around 150k tweets from May 12 through then.

Leave a Comment

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>