This term, I'm teaching Complex Systems 530 - Computer Modeling for Complex Systems at the University of Michigan Center for the Study of Complex Systems.  In the spirit of open science, all course material will be available online at Github.  You can browse the repository here: https://github.com/mjbommar/cscs-530-w2015.   In the course, we're exploring why and

## Advanced approximate sentence matching in Python

In our last post, we went over a range of options to perform approximate sentence matching in Python, an import task for many natural language processing and machine learning tasks.  To begin, we defined terms like: tokens: a word, number, or other "discrete" unit of text. stems: words that have had their "inflected" pieces removed based on

## Fuzzy match sentences in Python

Let's imagine you have a sentence of interest.  You'd like to find all occurrences of this sentence within a corpus of text.  How would you go about this? The most obvious answer is to look for exact matches of the sentence.  You'd search through every sentence of your corpus, checking to see if every character of the

## Isotonic Regressions in scikit-learn

Isotonic regression is a great tool to keep in your repertoire; it's like weighted least-squares with a monotonicity constraint.  Why is this so useful, you ask?  Take a look at the example relationship below. (You can follow along with the Python code here).       Let's imagine that the true relationship between x and y is characterized piece-wise by a sharp

## Is the Tax Code the longest Title?

Last week, I shared that Dan Katz and I had finally published a draft of our paper, Measuring the Complexity of the Law: The U.S. Code.  We'd previewed this research on Computational Legal Studies years ago.  Since then, we've received great feedback and a number of questions.   The most common question, even among legal professionals,

## Measuring the Complexity of the Law: The U.S. Code

Four years ago, Dan Katz and I began working on a project to measure the complexity of the law.  Its genesis was, in every sense, an accident; in order to properly identify citations to the IRC in our VTR empirical review of U.S. Tax Court decisions, we had to deal with the informal, non-Blue

## Revisiting text processing with R and Python

Back in 2011, I covered the relative performance difference of the most popular libraries for text processing in R and Python.   In case you can't guess the answer, Python and NLTK  won by a significant margin over R and tm.  Text processing with R seemed simple on paper, but performance and flexibility limitations have

## Generating SSH config from AWS hosts using boto

As a consultant and advisor to many firms running on or investigating AWS, I find SSH host and key management to be a constant struggle.  From IAM credentials to default OS logins, it's easy to lose time with constant lookups.  What we'd really like is to get a custom SSH config file for AWS.

## Git Repository for Congressional Bill Statistics

After a nice twitter conversation this morning, I finally got the impetus to release the source for my Congressional Bill Statistics data.   You can find the source at this Github repository.  I haven't taken the time to review licensing yet, but I won't be asserting anything more than CC3 Attribution on my code.

## Summary of community detection algorithms in igraph 0.6

Based on Launchpad traffic and mailing list responses, Gabor and Tamas will soon be releasing igraph 0.6.  In celebration, I’ll be publishing a number of helpful lists and tables I’ve put together to organize information about igraph.   In this post, we’ll cover the community detection algorithms (~i.e., clustering, partitioning, segmenting) available in 0.6

