It should be pretty clear by now that two things I’m very interested in are cloud computing and legal informatics. What better way to show it than to put together a simple AWS CloudSearch tutorial using Supreme Court decisions as the context? The steps below should take you through creating a fully functional search domain on AWS CloudSearch for Supreme Court decisions.
Acquiring Supreme Court decision data
Our first step is to acquire a public domain copy of Supreme Court decisions from Carl Malamud‘s resource.org. You can navigate to this directory and download US.tar.bz2, or just run something like:
$ wget http://bulk.resource.org/courts.gov/c/US.tar.bz2
Once the download is done, extract the archive:
$ tar xjf US.tar.bz2
We should now have a directory called US with 1.1GB and 62,839 files. Let’s assume that you put this directory under something like /data/courts/US.
Setting up Cloud Search command line tools
The next step is easy – go follow my guide on setting up Cloud Search command line tools! I’ll assume that you placed everything under /opt/aws/cloud-search-tools, just like in that post.
Creating a Cloud Search Domain
OK, we should now have a dataset and the Cloud Search API at our fingertips. It’s time to create a Cloud Search “domain” that we can populate with records. To do so, you can either follow the instructions on your AWS Management Console or run the following:
$ /opt/aws/cloud-search-tools/bin/cs-create-domain -d scotus
This may take awhile to create; sometimes up to 15 minutes. Go grab a coffee or a beer and read your feed while you wait. You can check the status either through the Management Console in browser or with the following line:
$ /opt/aws/cloud-search-tools/bin/cs-describe-domain -d scotus
Once this step is complete, you should see an ACTIVE domain with 0 documents. We now need to reconfigure the access policies so that the domain allows us to submit search material and anyone to search:
$ /opt/aws/cloud-search-tools/bin/cs-configure-access-policies -d scotus --update --allow IP_ADDRESS --service doc $ /opt/aws/cloud-search-tools/bin/cs-configure-access-policies -d scotus --update --allow all --service search
This policy change may take a few minutes to go into effect.
Lastly, we need to tell the domain what we are indexing per document.
$ /opt/aws/cloud-search-tools/bin/cs-configure-fields -d scotus --name title --type text --option result $ /opt/aws/cloud-search-tools/bin/cs-configure-fields -d scotus --name content --type text --option result
Populating the Cloud Search Domain
OK, we’re ready to go! At this point, we need to generate Search Data Format (SDF) files to populate the domain. There are two approaches we can take:
- Write a parser to extract exactly the text content and metadata we want.
- Throw the pre-packaged cs-generate-sdf utility at our data and hope for the best.
For brevity’s sake, we’ll pursue option 2. After some poking around, I’ve found that cs-generate-sdf is based on a common open-source content extraction library – Apache Tika. You might be familiar with Tika, as it’s the guts behind Solr’s ability to ingest unstructured data. So if you’d be happy naively ingesting the content in Solr, you’ll probably be happy with the results that cs-generate-sdf produces.
While we could build something more complex, let’s stick to bash here:
$ for d in `find /data/courts/US/ -type d`; do /opt/aws/cloud-search-tools/bin/cs-generate-sdf --source "$d/*.html" -d scotus; done
A few things to note:
- If you see error messages like “Request forbidden by administrative rules” or “403 Forbidden”, your access policies have not taken effect or you provided the wrong IP for the document service.
- You should see lots of lines go by; two for every file that is being parsed.
- This step can be parallelized, but will almost certainly be disk-bound unless you are running on some kind of RAID or NAS setup that allows for concurrent reads.
This could take awhile; about 45 minutes to generate and transmit on my i7 2600k/32GB RAM/SATA III SSD workstation. You should grab another coffee or beer and watch a show.
Another caveat: even after you’ve transmitted all data up to the cloud, it will still take some time for the Cloud Search instance to churn through the data and complete indexing.
Searching the Cloud Search Domain
Once the Cloud Search instance is fully built, it’s time to figure out how to search. The best way to do this is, sadly, to read the developer documentation. However, if you want to skip all the boring part, just try running something like this:
$ curl 'http://search-scotus-domain_id.us-east-1.cloudsearch.amazonaws.com/2011-02-01/search?q="clear%20and%20present%20danger"&return-fields=title'
This search looks for an exact phrase match on “clear and present danger” and returns not only the document ID, but also the title property of the document. You should get back something like this:
{"rank":"-text_relevance","match-expr":"(label '"clear and present danger"')","hits":{"found":100,"start":0,"hit":[{"id":"d__data_courts_us_395_395_us_444_492_html","data":{"title":["395 U.S. 444"]}},{"id":"d__data_courts_us_343_343_us_946_326_html","data":{"title":["343 U.S. 946"]}},{"id":"d__data_courts_us_341_341_us_494_336_html","data":{"title":["341 U.S. 494"]}},{"id":"d__data_courts_us_370_370_us_375_369_html","data":{"title":["370 U.S. 375"]}},{"id":"d__data_courts_us_435_435_us_829_76_1450_html","data":{"title":["435 U.S. 829"]}},{"id":"d__data_courts_us_328_328_us_331_473_html","data":{"title":["328 U.S. 331"]}},{"id":"d__data_courts_us_360_360_us_924_488_html","data":{"title":["360 U.S. 924"]}},{"id":"d__data_courts_us_414_414_us_890_72_6629_html","data":{"title":["414 U.S. 890"]}},{"id":"d__data_courts_us_295_295_us_441_665_html","data":{"title":["295 U.S. 441"]}},{"id":"d__data_courts_us_331_331_us_367_241_html","data":{"title":["331 U.S. 367"]}}]},"info":{"rid":"90c9b0fdba3e834bd8a0834c12371bbbcbe700391fa33547ff19c86ee8af36004f16216852072604","time-ms":5,"cpu-time-ms":0}}
So, there it is! Your own fully searchable AWS Cloud Search domain for the Supreme Court. Not so bad after all, was it?
Michael,
Great post!
A couple of niggles:
http://bulk.resource.org/courts.gov/c/US.tar.bz2
obtains US Supreme Court decisions as of March 2008. Yes?
Do you have a pointer to the tagging of the text? I am looking at some old PHP code but knowing the DTD would be a lot easier.
Thanks again!
Patrick
Hi Patrick,
Thanks. That link will download whatever Carl currently has; right now, the data quality is very mixed by year, but should end at volume 546 of the U.S. Reporter. Not sure where you found the 2008 number.
As far as tagging, what do you mean? I don’t think there is any POS tagging being performed by A9/CloudSearch under the hood.
Hi,
I am a cloud engineer and specialize in cloud computing development
and architecting as well big data analysis. Check out my blog for my
views @ http://www.cloudcer.com
I have recently published my first book. Its’ based on cloud computing
basics for the beginners. It covers the various aspects of cloud
computing and virtualization and services provided by all the major
cloud providers in all the service models like IaaS, PaaS & SaaS. You
can find the the book here: http://www.amazon.com/dp/B0083TC47C
Be Well,
Ravi Shankar
Wow. That is so elegant and logical and clearly explained. Brilliantly goes through what could be a complex process and makes it obvious.