Building a better legal search engine, part 1: Searching the U.S. Code

  As I mentioned last week, I’m excited to give a keynote in two weeks on Law and Computation at the University of Houston Law Center alongside Stephen Wolfram, Carl Malamud, Seth Chandler, and my buddy Dan from CLS.  The first part in my blog series leading up to this talk will focus on indexing and searching the U.S. Code with structured, public domain data and open source software.

  Before diving into the technical aspects, I thought it would be useful to provide some background on what the U.S. Code is and why it exists.  Let’s start with an example – the Dodd-Frank Wall Street Reform and Consumer Protection Act.  After the final version of HR 4173 was passed by both houses and enrolled in July of 2010, it received a new identifier, Public Law 111-230.  This public law, along with private laws, resolutions, amendments, and proclamations, is published in order of enactment in the Statutes at Large.  The Statutes at Large is therefore a compilation of all these sources of law dating back to the Declaration of Independence itself, and as such, is the authoritative source of statutory law.

  If we think about the organization and contents of the Statutes at Large, it quickly becomes clear why the Code exists.  The basic task of a legal practitioner is to determine what the state of law is with respect to a given set of facts at a certain time, typically now.  Let’s return to the Dodd-Frank example above.  Let’s say we’re in the compliance department at a financial institution and we’d like to know how the new proprietary trading rules affect us. To do this, we might perform the following tasks:

  • Search for laws by concept, e.g., depository institution or derivative.
  • Ensure that these laws are current and comprehensive.
  • Build a set of rules or guidelines from these laws.
  • Interpret these rules in the context of our facts.

  However, the Statutes at Large is not well-suited to these tasks.

  • It is sorted by date of enactment, not by concept.
  • It contains laws that may affect multiple legal concepts.
  • It contains laws that reference other laws for definitions or rules.
  • It contains laws that amend or repeal other laws.

   Based on our goal and these properties of the Statutes, we need to perform an exhaustive search every time we have a new question. This is pretty clearly bad if we want to get anything done (but hey, maybe you’re not in-house and you bill by the hour). So what might we do to re-organize the Statutes to make it easier for us to use the law?

  • Organize the law by concept, possibly hierarchically.
  • Combine laws that refer or amend one another.
  • Remove laws that have expired or have been repealed.
  • Provide convenient citations or identifiers for legal concepts.

   A systematic organization of the Statutes at Large that followed these rules would make our lives significantly easier. We could search for concepts and use the hierarchical context of these results to navigate related ideas. We could rest assured that the material we read was near-comprehensive and current. Furthermore, we could communicate more succintly by referencing a small number of organized sections instead of hundreds of Public Laws.

   As you might have guessed, this organizational scheme defines the United States Code as produced by the Office of the Law Revision Counsel. While the LRC traditionally distributes copies of the Code as ASCII files on CD-ROMs, they recently began distributing copies of the code in XHTML. We’ll be using these copies to build our index, so if you’d like to follow along, you should download them from here – http://uscode.house.gov/xhtml/.

  If we’d like to build a legal search engine, the Code is arguably the best place to start. While there are other important statutory and judicial sources like the Code of Federal Regulations or the Federal Reporter, the Code is as close to capital-L Law as it gets.

  In this part of the post series, I’m going to build an index of the text of the Code from the 2009 and 2010 LRC snapshots. To do this, we’ll use the excellent Apache Lucene library for Java. Lucene is, in their own words, a "a high-performance, full-featured text search engine library written entirely in Java." As we’ll see in later posts, Lucene (with its sister project, Solr) is a very easy and powerful tool to develop fast, web-based search interfaces. Before we dive into the code below the break, let’s take a look at what we’re working towards. Below is a search for the term "swap" across the entire Code. We’re displaying the top five results, and these were produced in a little over a second on my laptop.

$ mvn -q exec:java -Dexec.mainClass="org.mjb.searchCodeIndex" -Dexec.args="text swap"
documentid:7 U.S.C. 6s
currentthrough:20110107
score:2.2053032
itempath:
Title 7
CHAPTER 1
>§6s. Registration and regulation of swap dealers and major swap participants

documentid:7 U.S.C. 6r
currentthrough:20110107
score:2.0396917
itempath:
Title 7
CHAPTER 1
>§6r. Reporting and recordkeeping for uncleared swaps

documentid:7 U.S.C. 7b-3
currentthrough:20110107
score:1.7781076
itempath:
Title 7
CHAPTER 1
>§7b–3. Swap execution facilities

documentid:7 U.S.C. 24a
currentthrough:20110107
score:1.6279716
itempath:
Title 7
CHAPTER 1
>§24a. Swap data repositories

documentid:15 U.S.C. 77b-1
currentthrough:20100201
score:1.5701554
itempath:
Title 15
CHAPTER 2A
>SUBCHAPTER I
>>§77b–1. Swap agreements

Building the index: buildCodeIndex.java

  The first step is to construct the index of the Code with Lucene. In order to do this, I extract the LRC XHTML files from the ZIPs in-memory, tokenize these XHTML documents into sections, and pass these sections into Lucene’s analzyer. This analyzer stems the document’s tokens and strips stopwords, leaving only frequency information on the possible terms. This information, along with metadata about the section, is added into the Lucene index. The process took a little over two minutes on my laptop and around three minutes on a c1.xlarge EC2 instance.

Searching the index: searchCodeIndex.java

  Once we’ve built the index, we need to build a tool to search it. The code below is an example of a simple, single term search interface for the index. The results above for the "swap" term are an example of the output of this program.

Compiling the project: pom.xml

  Last but not least, you might also want to use Maven to compile and manage this project. If you do, here’s the pom.xml that goes along with the code.

  Stay tuned next week for the next part in the series. I’ll be using Apache Mahout to build an intelligent recommender system and cluster the sections of the Code.