Statistics on the length and linguistic complexity of bills

Where would you go to find out what the longest bill of the 112th Congress was by number of sections (H. R. 1473)? How about by number of unique words (H.R. 3671)? What about by Flesch-Kincaid reading level (S. 475)?

Head on over to this table of bills, updated daily for the 112th Congress, which contains the following fields:

Bill Name
Publish Date
Bill Title
Stage
Section Count
Sentence Count
Word (Token) Count
Unique Words (Tokens)
Unique Stem Count
Avg. Word Length
Avg. Sentence Length
Reading Level (Flesch-Kincaid)

I’ll be adding more automated analysis and figures over the next few weeks, but for now, here’s a morsel to get your gears turning.

By bcllc|2012-02-13T12:16:26-05:00February 13th, 2012|Law, Programming, Research|7 Comments

About the Author: bcllc

7 Comments

Matt Barney 2012-02-14 at 07:54 - Reply

It would be interesting to superipose a distribution of those bills that pass that could suggest a certain size for being more persuasive than smaller/bigger sizes.

Matt
- Michael J Bommarito II 2012-02-14 at 21:21 - Reply
  
  Hi Matt,
  Great idea. One of the dimensions that reports will be broken out by will be bill stage, so this analysis should be easy to perform once I release the new version.
Karen Suhaka 2012-02-14 at 19:18 - Reply

Great stuff. Can’t wait to see more figures!

I agree with Matt that it would be interesting to see if there’s a statistical difference between bills that passed, and those that didn’t.

I’ve got all the bills for the states in xml and in text. Want to check them out? Compare to federal? Compare states to each other? Let me know and I’ll get you access.

-k
- Michael J Bommarito II 2012-02-14 at 21:20 - Reply
  
  Hi Karen,
  I’d love to chat about working together on state legislation. Please feel free to email me when you get a chance.
Tom 2012-02-15 at 14:59 - Reply

This will be very interesting; I’m looking forward to future installments.

I hope you’ll forgive a criticism on the formatting of the graph. It took me several seconds to decipher the meaning of the coloring. Upon realizing that you were just double-plotting the data, I frankly felt that you were wasting my time. There may be some good reasons to overlay a heat map on a histogram (or, alternatively, to use marginal plots), but double-plotting the same data surely isn’t one of them.
Douglas Calvert 2012-04-09 at 01:50 - Reply

Can you post the code you used for this? I would love to see how word count / reading level fits in with the new bill passage prognosis from govtrack…
- Michael J Bommarito II 2012-04-14 at 12:41 - Reply
  
  Hi Doug,
  The code isn’t exactly release-quality at the moment, as I’m in the middle of a refactor :) I can point you at the NLTK contrib module `readability` that I used to get the F-K scores though: https://github.com/nltk/nltk_contrib/tree/master/nltk_contrib/readability

Statistics on the length and linguistic complexity of bills

About the Author: bcllc

7 Comments

Leave A Comment Cancel reply

Top Sliding Bar

Recent Tweets

Newsletter

Share This Story, Choose Your Platform!

About the Author: bcllc

7 Comments

Leave A Comment Cancel reply

Top Sliding Bar

Recent Tweets

Newsletter