Creating N-grams Using C#

Introduction

According to Wikipedia, an n-gram “is a contiguous sequence of n items from a given sequence of text or speech”.[1] In simplistic terms this means an n-gram is a collection of tokens where “n” equals the number of tokens contained within the collection.  A token within this context can basically be any portion of data divided into smaller logical pieces by the n-gram creator. One of the simplest examples to consider is a text document. A token within a text document might represent each individual word within the document as delimited by spaces with all punctuation characters removed.  However, many alternative tokenization strategies could be devised for the production of tokens from a given text document.  Considering the text document example, a 3-gram could represent an n-gram containing 3 word tokens and a 4-gram would contain 4 word tokens.  Once the n-gram parameter “n” and a tokenization method has been decided upon, each n-gram can be produced by starting with the first “n” tokens and creating the first n-gram.  From that point on, additional n-grams are produced by removing the first token contained within the current n-gram and concatenating one additional token to the end of the current n-gram.  This process is continued until the last “n” tokens within the provided document are reached, and the last n-gram is then created.

N-grams are used for many types of applications including computational linguistics, DNA sequencing, protein sequencing, and data compression just to name a few [1].  In order to better understand the benefits of n-grams, we can once again consider the text document example where tokens are generated using only individual words.  Using a standard “Bag of Words” approach to compare two documents for similarity, the Jaccard Index [2] for similarity would be calculated by comparing the intersection of all unique words contained within two target documents divided by the union of all unique words contained within the two target documents.  While this approach does produce a somewhat reasonable estimate of document similarity, several challenges can arise.  For instance, a document containing the sentence “SMU is the best college in Texas.”  would be scored as very similar to a document  containing the sentence “In Texas, the best college is TCU.”  While the two documents in this example are similar, they are not as semantically similar as the score would suggest.

Similarity Examples Using Jaccard

Document 1: “SMU is the best college in Texas.”

Document 2: “In Texas, the best college is TCU.”

Jaccard Similarity Using Single Words

Intersection = {the, best, college, in, texas, is} = 6

Union = {the, best, college, in, texas, is, smu, tcu} = 8

Jaccard Similarity = 6/8 = .75 or 75% using single words

Jaccard Similarity Using 3-Grams

Document 1 3-Grams

SMU is the best college in Texas

  • smu is the
  • is the best
  • the best college
  • best college in
  • college in texas

Document 2 3-Grams

In Texas, the best college is TCU.

  • in texas the
  • texas the best
  • the best college
  • best college is
  • college is tcu

Intersection = “the best college” = 1

Union = 9 total unique 3-grams

Jaccard = .11 or 11% using 3-grams

Which is Best?

As you can see from the examples above, using single words or n-grams to calculate a Jaccard Index for document similarity can produce very different results.  However, it may still be unclear to you which method would be the best choice.  This answer also depends on multiple factors.   Since n-grams typically contain multiple overlapping tokens, a document similarity calculated using n-grams better reflects and preserves the actual order of words within a document.  This means that a collection of documents sorted by a Jaccard Similarity calculated using n-grams would typically be more semantically accurate than one using only single words.  However, there is also a performance trade-off to consider as producing n-grams is more complex than producing only the unique words contained within a document.

N-gram Production Using C#

The following program written in C# demonstrates a simple method of producing n-grams from a string of text.  The program “yield returns” each n-gram produced as part of an IEnumerable<string> collection.  The yield return statement stops processing and releases each n-gram as they are produced in a pseudo-parallel fashion which allows downstream processing of n-grams to continue prior to all n-grams being produced.  A FIFO(first in first out) queue is used to keep track of each token’s length within the current n-gram.  Tokens are continually identified and added to the current n-gram buffer.  When the current n-gram buffer’s token count exceeds the specified n-gram size, a new n-gram is created and the first token within the current n-gram buffer is removed using the correct length provided by token length queue.

Many heuristics could be developed for generating tokens from text.  In the example below, I have devised a simple set of rules for determining when individual characters belong to a specific “word” token.  Space characters are used to delimit or separate each word.  Duplicate spaces are simply skipped.  Letters or digits always belong to the current word token, and other characters belong to the current word token when they are surrounded by valid letters or digits.  This does not cover all possible valid word tokens that might occur within various types of documents.  However, additional heuristics can be included using the program below as a starting point.  I have found that the n-grams produced by this program were very suitable for my own purposes.

nGrams

Performance

I found that using a simple for() loop when producing n-grams in C# provided two primary benefits.  First, individual characters within strings in the C# language can be accessed as individual members of a character array representing all characters within the entire string.  This allowed me to efficiently be able to look both forwards and backwards within the text for the previous and next characters using a simple and very efficient statement such as “char before = text[i-1];”.  Second, using other methods of separating words such as the split() function or a regular expression proved to be much slower while providing less heuristics flexibility than the for() loop in my own experiences.

The for() loop could also be written to execute in parallel with minor modifications, but that is a topic for another blog!

References

  1. N-gram. (2013, March 31). In Wikipedia, The Free Encyclopedia. Retrieved 18:52, April 22, 2013, from http://en.wikipedia.org/w/index.php?title=N-gram&oldid=548024834
  2. Jaccard index. (2013, April 9). In Wikipedia, The Free Encyclopedia. Retrieved 19:14, April 22, 2013, from http://en.wikipedia.org/w/index.php?title=Jaccard_index&oldid=549575590

Links

Advertisements

8 thoughts on “Creating N-grams Using C#

  1. Hey Jake, very usefull piece of code you wrote. Thanks for sharing. I did encounter a small error when trying to run it. the last part in the code is throwing an error.

    if (wordCount >= nGramSize)
    {
    yield return nGram.ToString();
    nGram.Remove(0, wordLengths.Dequeue() + 1);
    wordCount -= 1;
    }

    if found that :

    if (wordCount >= nGramSize)
    {
    yield return nGram.ToString();
    var dqLength = (int)wordLengths.Dequeue() + 1;
    nGram.Remove(0, dqLength);
    wordCount -= 1;
    }

    works for me.

    • Sorry, the Queue declaration needs to be as follows:

      Queue wordLengths = new Queue();

      for the following line to work without an error:

      nGram.Remove(0, wordLengths.Dequeue() + 1);

      You can also add the following at the end of the program to return the last nGram:

      nGram.Append(text.Last());
      yield return nGram.ToString();

      I will update the code example as well.

      Thanks!

      • The n-grams are typically not displayed as results, rather they are used to measure similarity between two sets or collections (when frequency is also used). For instance, you could use n-grams to determine which document to recommend to someone based on their search phrase.

      • Character n-grams are commonly used to predict language in text. (i.e. Is this document in French or English). I plan on writing an article on this soon.

  2. nGramSize = 1 gives the following error:
    Index was out of range. Must be non-negative and less than the size of the collection.
    Parameter name: index

    • This is not a tokenizer. It is used for creating n-grams of two or more words. There is a lot of extra code you don’t need for extracting words from text.

      If you insist on using this approach for tokenizing text, just comment out the following line:

      //nGram.Remove(0, wordLengths.Dequeue() + 1);

      Then replace it with this:

      //Use this code to support unigrams
      if (wordCount > 1)
      nGram.Remove(0, wordLengths.Dequeue() + 1);
      else
      nGram.Clear();

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s