Building an AWS CloudSearch domain for the Supreme Court

  It should be pretty clear by now that two things I’m very interested in are cloud computing and legal informatics.  What better way to show it than to put together a simple AWS CloudSearch tutorial using Supreme Court decisions as the context?  The steps below should take you through creating a fully functional search domain on AWS CloudSearch for Supreme Court decisions.

Acquiring Supreme Court decision data

  Our first step is to acquire a public domain copy of Supreme Court decisions from Carl Malamud‘s resource.org.  You can navigate to this directory and download US.tar.bz2, or just run something like:

$ wget http://bulk.resource.org/courts.gov/c/US.tar.bz2

Once the download is done, extract the archive:

$ tar xjf US.tar.bz2

  We should now have a directory called US with 1.1GB and 62,839 files.  Let’s assume that you put this directory under something like /data/courts/US.

Setting up Cloud Search command line tools

  The next step is easy – go follow my guide on setting up Cloud Search command line tools!  I’ll assume that you placed everything under /opt/aws/cloud-search-tools, just like in that post.

Creating a Cloud Search Domain

  OK, we should now have a dataset and the Cloud Search API at our fingertips.  It’s time to create a Cloud Search “domain” that we can populate with records.  To do so, you can either follow the instructions on your AWS Management Console or run the following:

$ /opt/aws/cloud-search-tools/bin/cs-create-domain -d scotus

  This may take awhile to create; sometimes up to 15 minutes. Go grab a coffee or a beer and read your feed while you wait.  You can check the status either through the Management Console in browser or with the following line:

$ /opt/aws/cloud-search-tools/bin/cs-describe-domain -d scotus

  Once this step is complete, you should see an ACTIVE domain with 0 documents. We now need to reconfigure the access policies so that the domain allows us to submit search material and anyone to search:

$ /opt/aws/cloud-search-tools/bin/cs-configure-access-policies -d scotus --update --allow IP_ADDRESS --service doc
$ /opt/aws/cloud-search-tools/bin/cs-configure-access-policies -d scotus --update --allow all --service search

This policy change may take a few minutes to go into effect.

Lastly, we need to tell the domain what we are indexing per document.

$ /opt/aws/cloud-search-tools/bin/cs-configure-fields -d scotus --name title --type text --option result
$ /opt/aws/cloud-search-tools/bin/cs-configure-fields -d scotus --name content --type text --option result

Populating the Cloud Search Domain

  OK, we’re ready to go!  At this point, we need to generate Search Data Format (SDF) files to populate the domain.  There are two approaches we can take:

  1. Write a parser to extract exactly the text content and metadata we want.
  2. Throw the pre-packaged cs-generate-sdf utility at our data and hope for the best.

  For brevity’s sake, we’ll pursue option 2.  After some poking around, I’ve found that cs-generate-sdf is based on a common open-source content extraction library – Apache Tika.  You might be familiar with Tika, as it’s the guts behind Solr’s ability to ingest unstructured data.  So if you’d be happy naively ingesting the content in Solr, you’ll probably be happy with the results that cs-generate-sdf produces.

  While we could build something more complex, let’s stick to bash here:

$ for d in `find /data/courts/US/ -type d`;
do
  /opt/aws/cloud-search-tools/bin/cs-generate-sdf --source "$d/*.html" -d scotus;
done

  A few things to note:

  • If you see error messages like “Request forbidden by administrative rules” or “403 Forbidden”, your access policies have not taken effect or you provided the wrong IP for the document service.
  • You should see lots of lines go by; two for every file that is being parsed.
  • This step can be parallelized, but will almost certainly be disk-bound unless you are running on some kind of RAID or NAS setup that allows for concurrent reads.

  This could take awhile; about 45 minutes to generate and transmit on my i7 2600k/32GB RAM/SATA III SSD workstation.  You should grab another coffee or beer and watch a show.

  Another caveat: even after you’ve transmitted all data up to the cloud, it will still take some time for the Cloud Search instance to churn through the data and complete indexing.

Searching the Cloud Search Domain

  Once the Cloud Search instance is fully built, it’s time to figure out how to search.  The best way to do this is, sadly, to read the developer documentation.  However, if you want to skip all the boring part, just try running something like this:

$ curl 'http://search-scotus-domain_id.us-east-1.cloudsearch.amazonaws.com/2011-02-01/search?q="clear%20and%20present%20danger"&return-fields=title'

  This search looks for an exact phrase match on “clear and present danger” and returns not only the document ID, but also the title property of the document.  You should get back something like this:

{"rank":"-text_relevance","match-expr":"(label '"clear and present danger"')","hits":{"found":100,"start":0,"hit":[{"id":"d__data_courts_us_395_395_us_444_492_html","data":{"title":["395 U.S. 444"]}},{"id":"d__data_courts_us_343_343_us_946_326_html","data":{"title":["343 U.S. 946"]}},{"id":"d__data_courts_us_341_341_us_494_336_html","data":{"title":["341 U.S. 494"]}},{"id":"d__data_courts_us_370_370_us_375_369_html","data":{"title":["370 U.S. 375"]}},{"id":"d__data_courts_us_435_435_us_829_76_1450_html","data":{"title":["435 U.S. 829"]}},{"id":"d__data_courts_us_328_328_us_331_473_html","data":{"title":["328 U.S. 331"]}},{"id":"d__data_courts_us_360_360_us_924_488_html","data":{"title":["360 U.S. 924"]}},{"id":"d__data_courts_us_414_414_us_890_72_6629_html","data":{"title":["414 U.S. 890"]}},{"id":"d__data_courts_us_295_295_us_441_665_html","data":{"title":["295 U.S. 441"]}},{"id":"d__data_courts_us_331_331_us_367_241_html","data":{"title":["331 U.S. 367"]}}]},"info":{"rid":"90c9b0fdba3e834bd8a0834c12371bbbcbe700391fa33547ff19c86ee8af36004f16216852072604","time-ms":5,"cpu-time-ms":0}}

  So, there it is! Your own fully searchable AWS Cloud Search domain for the Supreme Court. Not so bad after all, was it?