Let’s imagine you have a sentence of interest.  You’d like to find all occurrences of this sentence within a corpus of text.  How would you go about this?

The most obvious answer is to look for exact matches of the sentence.  You’d search through every sentence of your corpus, checking to see if every character of the sentence matched exactly.

But what if capitalization, punctuation, or white-spacing varied in the slightest?  Let’s look at an example sentence:

"In the eighteenth century it was often convenient to regard man as a clockwork automaton."

 

Now, consider the two variations below.

"in the eighteenth century    it was often convenient to regard man as a clockwork automaton"
"In the eighteenth century, it was often convenient to regard man as a clockwork automaton."

 

We would fail to match the first one for two reasons:

  • Capitalization: “in” does not match “In”
  • White-space: “century it” does not match “century    it”

We would also fail to match the second one for a separate reason:

  • Punctuation: “century, it” does not match “century it”

None of these reasons seems like a good excuse.  The “substance” of the sentence is identical, and so we’d prefer to match these sentences to our target if possible.  What can we do?  We need to learn to fuzzy match sentences, not exact match sentences.

So let’s learn to perform fuzzy sentence matching, also known as “approximate” sentence matching.   To do this, we’ll need to move from matching exact strings to more flexible, natural-language-motivated definitions of equality.  Examples of alternative equality definitions include:

  • Exact case-insensitive token matching after stopword removal
  • Exact case-insensitive stem matching after stopword removal
  • Exact case-insensitive lemma matching after stopword removal
  • Partial case-insensitive lemma sequence matching after stopword removal
  • Partial case-insensitive lemma set similarity after stopword removal
  • Partial case-insensitive noun lemma set similarity after stopword removal

To help walk you through how to perform fuzzy sentence matching like this in Python, I’ve put together an IPython notebook here.  This notebook uses Python and NLTK to perform each of the approximate or fuzzy matching approaches in the list above.

Let’s talk through each of the approaches, using the following set of sentences as our test sample:

  1. “In the eighteenth century it was often convenient to regard man as a clockwork automaton.”
  2. “in the eighteenth century      it was often convenient to regard man as a clockwork automaton”
  3. “In the eighteenth century, it was often convenient to regard man as a clockwork automaton.”
  4. “In the eighteenth century, it was not accepted to regard man as a clockwork automaton.”
  5. “In the eighteenth century, it was often convenient to regard man as clockwork automata.”
  6. “In the eighteenth century, it was often convenient to regard man as clockwork automatons.”
  7. “It was convenient to regard man as a clockwork automaton in the eighteenth century.”
  8. “In the 1700s, it was common to regard man as a clockwork automaton.”
  9. “In the 1700s, it was convenient to regard man as a clockwork automaton.”
  10. “In the eighteenth century.”
  11. “Man as a clockwork automaton.”
  12. “In past centuries, man was often regarded as a clockwork automaton.”
  13. “The eighteenth century was characterized by man as a clockwork automaton.”
  14. “Very long ago in the eighteenth century, many scholars regarded man as merely a clockwork automaton.”

Fuzzy match sentences in Python

Approach #1 – Case-insensitive token matching after stopword removal

Our first improvement would be to match case-insensitive tokens after removing stopwords.  This means three things:

  • Ignoring whether a character is upper or lower-cased (if relevant).  For example, “Apple” and “apple” match.
  • Treating the sentence as a sequence of “tokens.”  Tokens can be thought of as words, but also include other symbols like “1st” or “3.1415.”
  • Ignoring “stopwords,” which are high-frequency, low-content words like “is”, “or”, “but”, “the”, and “if.”

Once we do this, we will be able to fuzzy match sentences that we previously missed.  Let’s look at what’s happening underneath the hood:

  • Input: “In the eighteenth century it was often convenient to regard man as a clockwork automaton.”
  • Output:  [‘eighteenth’, ‘century’, ‘often’, ‘convenient’, ‘regard’, ‘man’, ‘clockwork’, ‘automaton’]

When we did exact matching, we failed to match the sentences #2 and #3  above with extra spaces or the additional comma.

Let’s take a look at these sentences when tokenized:

Sentence #2

  • Input: “in the eighteenth century it was often convenient to regard man as a clockwork automaton”
  • Output:  [‘eighteenth’, ‘century’, ‘often’, ‘convenient’, ‘regard’, ‘man’, ‘clockwork’, ‘automaton’]

Sentence #3

  • Input: “In the eighteenth century it was often convenient to regard man as a clockwork automaton.”
  • Output:  [‘eighteenth’, ‘century’, ‘often’, ‘convenient’, ‘regard’, ‘man’, ‘clockwork’, ‘automaton’]

As you can see, the tokens of sentences #2 and #3 now match our original sentence!  Our fuzzy matching sentences algorithm, shown below, blurred whitespace, punctuation, case, and low-content words.  While still possible to generate false-positive matches, this approach is a very conservative first option to fuzzy match.

Approach #2 – Case-insensitive stem matching after stopword removal

The next approach is to stem words instead of just tokenizing them.  Stemming is the process “uninflecting” or “reducing” words to their stem.  In English, the most common examples are the pluralization of nouns or conjugation of verbs, for example:

  • “cats” -> “cat”
  • “dogs” -> “dog”
  • “printing” -> “print”
  • “printed” -> “print”

        The idea behind stemming is that the inflected, i.e., declined or conjugated, forms of words all share a common meaning.   For example, both the words “cat” and “cats” both share the root concept of “cat”, even though the characters don’t match exactly.  Stemming is typically implemented, however, using preset rules that may or may not handle irregular or loan words.  For example, Old English weak declension nouns typically end in -en instead of -s, for example:

  • “children” -> “child”
  • “women” -> “woman”

          However, most stemmers fail to handle irregular nouns like this.  The Porter and Snowball stemmers, for example, incorrectly yield:

  • “children” -> “children”
  • “women” -> “women”

        One option is to lemmatize instead of stem; we’ll look at this below.  It’s important to understand that while lemmatizing is typically more accurate, it comes at a significant cost in complexity and may still fail for vernacular or irregular words. The code below shows how to perform this stemming matching:

Back to our sentences!  The methods above have successfully handled sentences #1, #2, and #3.    It turns out that sentences #4 and #5 are unfortunately not handled by our stemmer!  Sentence #4 features entirely different words, and sentence #5 features a Greek pluralization of automaton to “automata.”  The only sentence that we additionally handled via stemming is sentence #6, which features the English pluralization “automatons.”

Approach #3 – Case-insensitive lemma matching after stopword removal

Stemming is valuable, but with irregular words, we need to consider lemmatization.  Lemmatization, as discussed above, takes a more complex but comprehensive approach to breaking inflected words down to their base.    In the case of lemmatization, we call the resulting form a lemma instead of  a stem.

Let’s see how the WordNet lemmatizer handles our examples above:

  • “cats” -> “cat”
  • “dogs” -> “dog”
  • “printing” -> “printing”
  • “printed” -> “printed”
  • “children” -> “child”
  • “women” -> “woman”

One important source of complexity in lemmatization is that it relies on part-of-speech tagging.  The “printing” example above is a great example.  If “printing” occurs as a noun, then it is not an inflected form; however, if “printing” occurs as a verb, then it’s uninflected form is “print.”  Therefore, it’s critical that we provide the lemmatizer with the part-of-speech.  Doing this means that we need to first determine the parts-of-speech for all tokens, then lemmatize the non-stopword tokens.

The code example below handles this process.

Once we’ve executed this process, what sentences are we now able to match?   Now that we have switched from stemming to lemmatization, we have properly handled the Greek plural “automata;” as a result, we now match sentence #5.  Let’s recap where we are at this point:

  1. “In the eighteenth century it was often convenient to regard man as a clockwork automaton.”
  2. “in the eighteenth century      it was often convenient to regard man as a clockwork automaton”
  3. “In the eighteenth century, it was often convenient to regard man as a clockwork automaton.”
  4. “In the eighteenth century, it was not accepted to regard man as a clockwork automaton.”
  5. “In the eighteenth century, it was often convenient to regard man as clockwork automata.”
  6. “In the eighteenth century, it was often convenient to regard man as clockwork automatons.”
  7. “It was convenient to regard man as a clockwork automaton in the eighteenth century.”
  8. “In the 1700s, it was common to regard man as a clockwork automaton.”
  9. “In the 1700s, it was convenient to regard man as a clockwork automaton.”
  10. “In the eighteenth century.”
  11. “Man as a clockwork automaton.”
  12. “In past centuries, man was often regarded as a clockwork automaton.”
  13. “The eighteenth century was characterized by man as a clockwork automaton.”
  14. “Very long ago in the eighteenth century, many scholars regarded man as merely a clockwork automaton.”

At this point, we’ve learned a lot about tokenizing, stopwording, stemming, and lemmatizing.  However, we’ve only matched about half of the sentences that we would characterize as similar.  In the next post, we’ll go over partial matches based on the following techiques:

  • Case-insensitive partial lemma sequence matching after stopword removal
  • Case-insensitive lemma set Jaccard similarity after stopword removal
  • Case-insensitive noun lemma Jaccard similarity after stopword removal

If you’d like to skip ahead, or you’d like to see the IPython notebook accompanying this post, you can cheat and read ahead here to learn more about fuzzy matching sentences in Python.