eDiscovery Consulting in the Cloud: Searching an Outlook mailbox and attachments

You may have noticed that I keep talking about eDiscovery consulting and legal search in the cloud. I’ve covered searching the Supreme Court with new technologies in analytics and the cloud, making certain types of emails searchable on Amazon’s cloud, and even eDiscovery and the cloud at a high level. While these posts are all great teasers, you might be left wondering whether there are mature technologies that address the day-to-day needs of real firms.

I’d like to address that concern today by presenting a real-world case study on making Outlook PST mail and attachments discoverable. The software and process I will describe are fully implemented and tested. Nothing in this post is vaporware, and the entire process is scalable from a single mailbox up to an entire business. I’ll skip the usual discussion of the EDRM model and get straight to business. You have received Outlook mailboxes for 100 users. You need to search these mailboxes, including their attachments, for usages of a certain phrase or conversations between specific people. How do you do this quickly and painlessly without breaking your existing IT staff and budget?

In this example, I’ll be using 7 Outlook PST files shown below from the Enron email dataset. These mailboxes total 1.3GB and contain email text as well attachments in a variety of formats like PDF and Office. As such, they are a typical cross-section of corporate email.

In order to convert this folder to a searchable database, we need to follow the following process:

Examine each file in the folder and determine whether we understand the format. In this case, all 7 files are PST files, which we can process.
For each of these mailboxes, we then build a list of all emails, storing information about who sent the message, who it was to, what they said, etc.
For each email, we also identify any attachments. If we understand the attachment format, we also extract and store the textual content of the attachment and associate it with the related email conversation. Supported textual formats include:
1. MS Office 97 – 2010 Documents from Word, Excel, PowerPoint, etc.
2. Adobe PDF
3. OpenOffice Documents
4. Web pages
5. Plain or rich-text documents
Regardless of whether we can extract text from an attachment, we save the file to a folder on our computer where we can later examine it.
We transmit all of this data to AWS CloudSearch, which handles indexing the textual content, as well as facets like from and to addresses.
Once CloudSearch finishes building the index, we are ready to search!

We have taken those 7 Outlook mailboxes and converted them into:

A search interface for all textual email and attachment content.
A folder of attachment files embedded in all emails, such as images, audio, videos, or other non-textual content.

In total, the total data set is processed on a laptop in under an hour, and search results and interfaces are available over the web almost immediately thereafter. If you’re interested in pricing or more information for eDiscovery consulting services like these, please contact us!

One Comment

Mark Kerzner 2012-05-20 at 23:16 - Reply

Hi, Michael,

That’s pretty much the same as my earlier idea of using Google Desktop Search (now defunct) for a quick-and-dirty search implementation. I called it DiscoverLite, even had a trademark on that. Even the web site.

However, now I have implemented the real solution – also on EC2, but with complete processing and search, not with Amazon indexing services. Based on Hadoop. See also here, http://freeeed.org/

I would be interested in your feedback.

Thank you.

eDiscovery Consulting in the Cloud: Searching an Outlook mailbox and attachments

Top Sliding Bar

Recent Tweets

Newsletter

Share This Story, Choose Your Platform!

About the Author: bcllc

One Comment

Leave A Comment Cancel reply

Top Sliding Bar

Recent Tweets

Newsletter