Let's imagine you have a sentence of interest. You'd like to find all occurrences of this sentence within a corpus of text. How would you go about this? The most obvious answer is to look for exact matches. But what if capitalization, punctuation, or white-spacing varied in the slightest?
Consider the sentence: "In the eighteenth century it was often convenient to regard man as a clockwork automaton." Small variations in capitalization, whitespace, or punctuation would cause exact matching to fail, even though the substance of the sentence is identical. We need to learn to fuzzy match sentences, not exact match sentences.
"In the eighteenth century it was often convenient to regard man as a clockwork automaton." To perform fuzzy sentence matching, we need to move from matching exact strings to more flexible, natural-language-motivated definitions of equality. Examples include: exact case-insensitive token matching after stopword removal, exact case-insensitive stem matching after stopword removal, exact case-insensitive lemma matching after stopword removal, and partial set similarity approaches.
"in the eighteenth century it was often convenient to regard man as a clockwork automaton"
"In the eighteenth century, it was often convenient to regard man as a clockwork automaton." Our first improvement is case-insensitive token matching after stopword removal. This means ignoring case, treating the sentence as a sequence of tokens, and ignoring stopwords (high-frequency, low-content words like "is", "or", "the"). After processing our example sentence, we get: ['eighteenth', 'century', 'often', 'convenient', 'regard', 'man', 'clockwork', 'automaton']. This blurs whitespace, punctuation, case, and low-content words.
The next approach is stemming — the process of "uninflecting" or "reducing" words to their stem. In English, common examples include "cats" → "cat" and "printing" → "print". Stemming is typically implemented using preset rules that may not handle irregular words. The Porter and Snowball stemmers, for example, fail on irregular nouns like "children" and "women".
The third approach uses lemmatization, which takes a more complex but comprehensive approach. The WordNet lemmatizer handles irregular forms correctly: "children" → "child", "women" → "woman". One important source of complexity is that lemmatization relies on part-of-speech tagging — "printing" as a noun is not inflected, but "printing" as a verb should reduce to "print".
Once we've executed lemmatization, we can handle cases like the Greek plural "automata" being correctly matched to "automaton". At this point, we've learned a lot about tokenizing, stopwording, stemming, and lemmatizing, but we've only matched about half of the sentences that we would characterize as similar. The next post covers partial matches using Jaccard set similarity.