Dan and I recently released a new legal informatics project with a few colleagues. The project, which we’ve named the Legal Language Explorer, provides an interface similar to Google Ngrams Viewer for the U.S. Supreme Court. Unlike Google’s viewer, however, the Legal Language Explorer also allows users to drill-down into case-level information for each n-gram. The technical architecture that allowed us to robustly provide both of these worlds wasn’t simple, but I felt that the experience and result are worth sharing.
Before I get into the technical aspects, I wanted to mention one especially important feature of the project – the low cost of failure in interaction. By engineering a system where the downside of bad choices is small relative to the payoffs, people are much more willing to experiment. Sure, that search for “gorilla” was a dud, but it only took 50ms. Before disappointment can even register, you’ve already gone on to “mickey mouse” or “Kant.” This may seem like an unimportant point, but when search spaces are large, exploring them requires asymptotically small costs. For Legal Language Explorer, this low cost is a direct consequence of combining two technologies – SQL and noSQL.
(Aside: While SQL and noSQL are often viewed as antithetical, I think this is a product of a confused conversation. Our real goal should be to store or process data efficiently. noSQL is a response to the frustrations of attacking problems like graph traversal, document store, or tall-and-skinny numeric with traditional, row-store SQL. The world isn’t so black-and-white that we can generalize well, however. For example, SQL databases like Vertica can compete with noSQL databases like kdb for financial data, and SQL databases that support WITH RECURSIVE can compete with neo4j on shallow traversals.)
The front page of Legal Language Explorer displays a time series plot of usage for “interstate commerce,” “railroad,” and “deed” between 1791 and 2005. This plot contains 645 (year, n-gram) values and represents a total of 103,285 occurrences. If your experience is like most people’s, the plot has loaded before the page text has even rendered. Is it just a cached PNG? Is the data hard-coded into the page?
The answer lies in redis, a small, neatly written key-value store that many refer to as a noSQL database (to me, redis is more of an in-memory data structure server than a database). Regardless of what you call it, redis excels at providing blazingly fast read and write access to common structures. It does this by keeping all data live in memory, organized for efficient access. Here are some basic facts and figures on our redis instance:
- redis Architecture: Single 64-bit instance
- Memory: 12GB running (4GB dump)
- Keyspace: 103,281,603 keys, up to 214 fields per key
- Average Keyspace Hit Rate: 77.7%
Despite the size of these numbers, the HGETALL query that drives the Legal Language Explorer plot returns in under 50ms for every query we’ve tested, even under load from 100 concurrent requests. Could you deliver a solution like that in a traditional RDBMS?
If you find an n-gram that you want to learn more about, Legal Language Explorer allows you to drill down and view the list of all cases that n-gram occurs in. This table provides the full citation, year, and title for every case. Since there are often many parties with long names, storing this in redis would be expensive! What do we do?
For drill-down, we rely on a traditional, row-store database; in this case, postgres. The recipe is fairly straightforward. Populate a (somewhat) normalized database with a case table, and an occurrence table, where occurrences are pairs of (ngram, case_id). Throw in an index on the n-gram, and you’re done. The final product looks something like this:
- SELECT COUNT(*) FROM occurrences: 298,471,830
- Indices: 10GB
- Total Tablespace: 25GB
When we serve up data to the user, we SELECT on occurrences, constraining by n-gram, and JOIN the matching cases. (Would normalizing the n-grams help us get to 3NF? Yes, but it didn’t help us conserve CPU while competing with redis.)
The end result is that most queries are returned in under 10 seconds. This is great news for everyone who wanted a list of Supreme Court opinions referencing Elvis Presley. Search away!
Wrapping it up
Legal Language Explorer allows users to explore n-grams in the Supreme Court corpus at the macroscopic and microscopic level. This dual mandate is met through a marriage of SQL and noSQL, and we feel that the whole experience is much greater than the sum of the parts. Over the next few months, we hope to raise enough money to add new data and features to the site, so please stay tuned!
(If you’re really paying attention to the geeky stuff, you may have noticed that I said nothing about the hardware we’re running on. I didn’t want to draw attention to this, since this post is about software, not hardware. However, if you have to know, the site is running on an EC2 m2.2xlarge spot instance for about $10/day. Not bad!)