When I put together my original post on the length and complexity of Congressional bills, I was hoping to build forward momentum on the project.  The goal was to build a simple, sortable and searchable interface to explore and visualize the data.  As usual, however, paying employers and consulting clients got in the way :)

  While the interface doesn’t exist yet, I did want to provide an update on the current state of the project.  I’ve refactored all code into a cleaner, more flexible architecture.  The data layer and processing are all Python and friends (lxml, NLTK, dateutil).  All processed data is now persisted into SQLite, which allows for cleaner report generation and the potential for light-weight, read-only dynamic search.

  You can consume this in the following ways:

Here’s the SQLite schema:
sqlite> .schema
CREATE TABLE congress (
	id INTEGER PRIMARY KEY,
	description VARCHAR
);
CREATE TABLE document (
	id INTEGER PRIMARY KEY AUTOINCREMENT,
	filename VARCHAR,
	congress_id INTEGER NOT NULL,
	session VARCHAR,
	published DATETIME,
	citation VARCHAR,
	type VARCHAR,
	title VARCHAR,
	stage VARCHAR,
	chamber VARCHAR,
	md5 VARCHAR,
	processed BOOLEAN,
	FOREIGN KEY(congress_id) REFERENCES congress(id)
);
CREATE TABLE document_statistics (
	document_id INTEGER NOT NULL,
	statistic_id VARCHAR,
	statistic_value REAL,
	FOREIGN KEY(document_id) REFERENCES document(id)
);

  P.S. If you haven’t seen it yet, check out the GovTrack Bill Prognosis blog post.  While I’m a little bummed that Dan Katz, Dan Magleby, and I got scooped, I’m happy to see information like this provided to the public.