You may have noticed that I keep talking about eDiscovery consulting and legal search in the cloud. I’ve covered searching the Supreme Court with new technologies in analytics and the cloud, making certain types of emails searchable on Amazon’s cloud, and even eDiscovery and the cloud at a high level. While these posts are all great teasers, you might be left wondering whether there are mature technologies that address the day-to-day needs of real firms.
I’d like to address that concern today by presenting a real-world case study on making Outlook PST mail and attachments discoverable. The software and process I will describe are fully implemented and tested. Nothing in this post is vaporware, and the entire process is scalable from a single mailbox up to an entire business. I’ll skip the usual discussion of the EDRM model and get straight to business. You have received Outlook mailboxes for 100 users. You need to search these mailboxes, including their attachments, for usages of a certain phrase or conversations between specific people. How do you do this quickly and painlessly without breaking your existing IT staff and budget?
In this example, I’ll be using 7 Outlook PST files shown below from the Enron email dataset. These mailboxes total 1.3GB and contain email text as well attachments in a variety of formats like PDF and Office. As such, they are a typical cross-section of corporate email.
In order to convert this folder to a searchable database, we need to follow the following process:
- Examine each file in the folder and determine whether we understand the format. In this case, all 7 files are PST files, which we can process.
- For each of these mailboxes, we then build a list of all emails, storing information about who sent the message, who it was to, what they said, etc.
- For each email, we also identify any attachments. If we understand the attachment format, we also extract and store the textual content of the attachment and associate it with the related email conversation. Supported textual formats include:
- MS Office 97 – 2010 Documents from Word, Excel, PowerPoint, etc.
- Adobe PDF
- OpenOffice Documents
- Web pages
- Plain or rich-text documents
- Regardless of whether we can extract text from an attachment, we save the file to a folder on our computer where we can later examine it.
- We transmit all of this data to AWS CloudSearch, which handles indexing the textual content, as well as facets like from and to addresses.
- Once CloudSearch finishes building the index, we are ready to search!
Hi, Michael,
That’s pretty much the same as my earlier idea of using Google Desktop Search (now defunct) for a quick-and-dirty search implementation. I called it DiscoverLite, even had a trademark on that. Even the web site.
However, now I have implemented the real solution – also on EC2, but with complete processing and search, not with Amazon indexing services. Based on Hadoop. See also here, http://freeeed.org/
I would be interested in your feedback.
Thank you.