Generating AWS CloudSearch SDF for Emails

  In my last post on CloudSearch and eDiscovery, I described something like “Google” for eDiscovery emails.  FedEx or DropBox your data to an eDiscovery service provider like myself, and rest assured that you’ll soon have a powerful, web-based user interface for searching and visualizing your digital discovery materials.

  As a technical follow-up to this post, I thought I’d share a proof-of-concept email parser based on the Enron email dataset.  The Python script below takes a directory of RFC822 email messages and returns an AWS CloudSearch JSON SDF with fields from the Date, From, To, Subject, and Body fields of the email.  There is no special handling for attachments or encoding in this example, but it can be used to populate a CloudSearch domain from the Enron emails. Sample usage below, as well as the output sample here.

$ python src/generateSDF.py "data/maildir/allen-p/inbox/*" | curl -X POST -d "@-" --header "Content-Type: application/json" doc-domain_name-domain_id.us-east-1.cloudsearch.amazonaws.com/2011-02-01/documents/batch

Source code below the break.