Building an AWS CloudSearch Domain for the Supreme Court

It should be pretty clear by now that two things I'm very interested in are cloud computing and legal informatics. What better way to show it than to put together a simple AWS CloudSearch tutorial using Supreme Court decisions as the context? The steps below should take you through creating a fully functional search domain on AWS CloudSearch for Supreme Court decisions.

Our first step is to acquire a public domain copy of Supreme Court decisions from Carl Malamud's resource.org. You can navigate to this directory and download US.tar.bz2, or just run something like:

$ wget http://bulk.resource.org/courts.gov/c/US.tar.bz2

Download Supreme Court decisions from resource.org.

Once the download is done, extract the archive:

$ tar xjf US.tar.bz2

Extract the archive.

We should now have a directory called US with 1.1GB and 62,839 files. Let's assume that you put this directory under something like /data/courts/US.

The next step is easy – go follow my guide on setting up Cloud Search command line tools! I'll assume that you placed everything under /opt/aws/cloud-search-tools, just like in that post.

OK, we should now have a dataset and the Cloud Search API at our fingertips. It's time to create a Cloud Search "domain" that we can populate with records. To do so, you can either follow the instructions on your AWS Management Console or run the following:

$ /opt/aws/cloud-search-tools/bin/cs-create-domain -d scotus

Create a new CloudSearch domain named 'scotus'.

This may take awhile to create; sometimes up to 15 minutes. Go grab a coffee or a beer and read your feed while you wait. You can check the status either through the Management Console in browser or with the following line:

$ /opt/aws/cloud-search-tools/bin/cs-describe-domain -d scotus

Check the status of the CloudSearch domain.

Once this step is complete, you should see an ACTIVE domain with 0 documents. We now need to reconfigure the access policies so that the domain allows us to submit search material and anyone to search:

$ /opt/aws/cloud-search-tools/bin/cs-configure-access-policies -d scotus --update --allow IP_ADDRESS --service doc
$ /opt/aws/cloud-search-tools/bin/cs-configure-access-policies -d scotus --update --allow all --service search

Configure access policies to allow document submission and public search.

This policy change may take a few minutes to go into effect.

Lastly, we need to tell the domain what we are indexing per document.

$ /opt/aws/cloud-search-tools/bin/cs-configure-fields -d scotus --name title --type text --option result
$ /opt/aws/cloud-search-tools/bin/cs-configure-fields -d scotus --name content --type text --option result

Configure the index fields for the domain.

OK, we're ready to go! At this point, we need to generate Search Data Format (SDF) files to populate the domain. There are two approaches we can take: (1) Write a parser to extract exactly the text content and metadata we want, or (2) Throw the pre-packaged cs-generate-sdf utility at our data and hope for the best.

For brevity's sake, we'll pursue option 2. After some poking around, I've found that cs-generate-sdf is based on a common open-source content extraction library – Apache Tika. You might be familiar with Tika, as it's the guts behind Solr's ability to ingest unstructured data. So if you'd be happy naively ingesting the content in Solr, you'll probably be happy with the results that cs-generate-sdf produces.

While we could build something more complex, let's stick to bash here:

$ for d in `find /data/courts/US/ -type d`;
do
  /opt/aws/cloud-search-tools/bin/cs-generate-sdf --source "$d/*.html" -d scotus;
done

Generate SDF files and upload them to CloudSearch for each directory of court decisions.

A few things to note: If you see error messages like "Request forbidden by administrative rules" or "403 Forbidden", your access policies have not taken effect or you provided the wrong IP for the document service. You should see lots of lines go by; two for every file that is being parsed. This step can be parallelized, but will almost certainly be disk-bound unless you are running on some kind of RAID or NAS setup that allows for concurrent reads.

This could take awhile; about 45 minutes to generate and transmit on my i7 2600k/32GB RAM/SATA III SSD workstation. You should grab another coffee or beer and watch a show.

Another caveat: even after you've transmitted all data up to the cloud, it will still take some time for the Cloud Search instance to churn through the data and complete indexing.

Once the Cloud Search instance is fully built, it's time to figure out how to search. The best way to do this is, sadly, to read the developer documentation. However, if you want to skip all the boring part, just try running something like this:

$ curl 'http://search-scotus-domain_id.us-east-1.cloudsearch.amazonaws.com/2011-02-01/search?q="clear%20and%20present%20danger"&return-fields=title'

Search the CloudSearch domain for the phrase 'clear and present danger'.

This search looks for an exact phrase match on "clear and present danger" and returns not only the document ID, but also the title property of the document. You should get back something like this:

{"rank":"-text_relevance","match-expr":"(label '\"clear and present danger\"')","hits":{"found":100,"start":0,"hit":[{"id":"d__data_courts_us_395_395_us_444_492_html","data":{"title":["395 U.S. 444"]}},{"id":"d__data_courts_us_343_343_us_946_326_html","data":{"title":["343 U.S. 946"]}},{"id":"d__data_courts_us_341_341_us_494_336_html","data":{"title":["341 U.S. 494"]}}]}}

Sample JSON response from the CloudSearch query, showing matching Supreme Court cases.

So, there it is! Your own fully searchable AWS Cloud Search domain for the Supreme Court. Not so bad after all, was it?

cloud-computing aws legal-tech search tutorial

Building an AWS CloudSearch Domain for the Supreme Court

More Insights

Building Legal Language Explorer: Interactivity and Drill-Down, noSQL and SQL

eDiscovery Consulting in the Cloud: Searching an Outlook Mailbox and Attachments

“Google” for Subpoenaed Emails: AWS CloudSearch for eDiscovery

Let's Work Together