Blog

Statistics on the length and linguistic complexity of bills

  Where would you go to find out what the longest bill of the 112th Congress was by number of sections (H. R. 1473)?  How about by number of unique words (H.R. 3671)?  What about by Flesch-Kincaid reading level  (S. 475)?

  Head on over to this table of bills, updated daily for the 112th Congress, which contains the following fields:

  • Bill Name
  • Publish Date
  • Bill Title
  • Stage
  • Section Count
  • Sentence Count
  • Word (Token) Count
  • Unique Words (Tokens)
  • Unique Stem Count
  • Avg. Word Length
  • Avg. Sentence Length
  • Reading Level (Flesch-Kincaid)

I’ll be adding more automated analysis and figures over the next few weeks, but for now, here’s a morsel to get your gears turning.

7 Comments

  1. Matt Barney

    It would be interesting to superipose a distribution of those bills that pass that could suggest a certain size for being more persuasive than smaller/bigger sizes.

    Matt

    1. Michael J Bommarito II

      Hi Matt,
      Great idea. One of the dimensions that reports will be broken out by will be bill stage, so this analysis should be easy to perform once I release the new version.

  2. Karen Suhaka

    Great stuff. Can’t wait to see more figures!

    I agree with Matt that it would be interesting to see if there’s a statistical difference between bills that passed, and those that didn’t.

    I’ve got all the bills for the states in xml and in text. Want to check them out? Compare to federal? Compare states to each other? Let me know and I’ll get you access.

    -k

    1. Michael J Bommarito II

      Hi Karen,
      I’d love to chat about working together on state legislation. Please feel free to email me when you get a chance.

  3. Tom

    This will be very interesting; I’m looking forward to future installments.

    I hope you’ll forgive a criticism on the formatting of the graph. It took me several seconds to decipher the meaning of the coloring. Upon realizing that you were just double-plotting the data, I frankly felt that you were wasting my time. There may be some good reasons to overlay a heat map on a histogram (or, alternatively, to use marginal plots), but double-plotting the same data surely isn’t one of them.

  4. Douglas Calvert

    Can you post the code you used for this? I would love to see how word count / reading level fits in with the new bill passage prognosis from govtrack…

    1. Michael J Bommarito II

      Hi Doug,
      The code isn’t exactly release-quality at the moment, as I’m in the middle of a refactor :) I can point you at the NLTK contrib module `readability` that I used to get the F-K scores though: https://github.com/nltk/nltk_contrib/tree/master/nltk_contrib/readability

Leave a Comment

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>