Let me be clear”, I’m going to show you how to build massive n-gram frequency dictionaries efficently, query them efficiently, and save them to binary files efficiently using the R package cmscu.

C.M.S.C.U. stands for Count-min sketch with conservative updating. My interest is in applying this technique to language data. There are lots of interesting underlying details (I’ll show you some below), but let’s just see it in action. Sound cool?

We developed this package as a replacement to the DocumentTermMatrix (dtm) function in R‘s tm package for \(n\)-grams. \(n\)-grams are a simple and elegant way to model the frequency of language data by counting the instances of words’ occurences within a corpus. Even simple models like this can be computationally expensive when trying to model larger \(n\)-grams (e.g., \(3\)- and \(4\)-grams) and corpora - such as Amazon reviews, Yelp reviews, Twitter posts and H. R. Clinton’s emails. We don’t need all the funcionality that comes with the dtm when we only want \(n\)-grams. In fact, using the dtm with huge datasets over larger \(n\)-grams is impossible. Sorry. It can’t be done. We solves some computational bottlenecks when we use cmscu (detailed below).

If you want to learn more about how this work can be used to implement sophisticated smoothing alogrithms check out our manuscript:

Vinson, D. W., Davis, J. K., Sindi, S. S., & Dale, R. (2016). Efficient n-gram analysis in R with cmscu. Behavior Research Methods. 48(3), 909-922. doi: 10.3758/s13428-016-0766-5

Install dependencies

There are a few dependencies required to install cmscu depending on your operating system. cmscu was built under R 3.1, so first you’ll need to install the latest version of R, here. cmscu’s implementation was written in C++. To link this code to R requires a C++ development environment. This is super easy but different for different operating systems.

After you’re downloaded and installed the above, restart R. cmscu was recently submitted to CRAN, but for now you can download it here. Once downloaded, install it by following the instruction below!

Install & Load cmscu

#first, install Rcpp. This connects C++ code to your R environment 
install.packages('Rcpp',repos = "http://cran.stat.ucla.edu/")
#set your working directory 
#make sure you're in the right folder
#install cmscu
install.packages('cmscu.tar.gz', repos = NULL, type = "src")

Build Frequency Dictionaries

# unigrams
yelp_1g <- new(FrequencyDictionary,4,2^26)
# bigrams
yelp_2g <- new(FrequencyDictionary,4,2^26)
# trigrams
yelp_3g <- new(FrequencyDictionary,4,2^26)
# 4-grams
yelp_4g <- new(FrequencyDictionary,4,2^26)

CMSCU solves three computationally taxing problems:

  1. It saves words as integers, not strings new(FrequencyDictionary, …, …)

  2. It limits dictionary size to fit within the constraints of your machine’s memory new(…, …, 2^26)
    • You won’t accidently find yourself swapping hardisk space for memory (As a result your code will never finish running)
  3. It solves the pigeonhole problem new(…, 4, …)

    • The Pigeonhole problem occurs when you store more entries than the size of your dictionary. When you do this, collisions occur. If “pizza” is saved to the same integer key as “funky”, when you query either “funky” or “pizza” the number of counts associated with either will equal both words combined. However, this problem is solved by a simple princple: If a collision on one dictionary is rare, then the SAME collision (funky/pizza) occuring in an entirely indpendent dictionrary is extremely rare. So we build out 4 dictionaries (or hash tables) for EACH n dictionary.

How to Determine Dictionary Size

This gets messy and it’s hard provide a true estimate without filling our dictionaries first. Here’s what we can know:

We can do some simple math to determine the size each dictionary (\(width\)) should be. \(1gb = 2^{30} bytes\). With \(4\) bytes per entry that means we have \(2^{28}\) bytes to play with. Now, we’re building out \(4\) (\(hash\)) dictionaries per each \(n\)-gram (this increases our confidence in our uncertainty…more on this later). This means we really only have \(2^{26}\) bytes to allocate to each size \(n\). So with \(1\) gigabyte of memory for each \(n\)-gram dictionary, our dictionary size (\(width\)) should not exceed \(2^{26}\) slots (e.g., new(FrequencyDictionary, 4, 2^26)):

\[\frac{{hash} * {width}}{2^{28}} = \frac{{4} * {2^{26}}}{2^{28}}= 1Gigabyte\]

Why You Should Care

This gives you an estimate of the size of the dictionary you can build efficiently on your machine. The uncertainty in estimated counts that occur due to collisions provided a large enough corpus is essentially what you will need to be comfortable with given your machine. The more memory you have, the bigger your dictionary can be and the more certain you can be in the accuracy of your estimated counts.

CMSCU in Action

I use Yelp reviews because I am interested in uncovering how different cognitive and social factors - such as the reviewer’s exprience (positive/negative) or their social network - influence how they write reviews. (You can download them now using the link above)

Let’s start by defining a few functions:


# a text cleaning function
clean <- function(line) {
  # upper-case everything
  str <- toupper(line);
  # strip-out small html tags
  str <- gsub('<[^>]{1,2}>', '', str);
  # replace all terminal punctuation with a period
  str <- gsub('[[:space:]]*[.?!:;]+[[:space:]]*', '.', str);
  # get rid of anything not A-Z, ', ., or whitespace
  str <- gsub('[^A-Z\'.[:space:]]', ' ', str);
  # crowd apostrophes
  # str <- gsub("[[:space:]]+([A-Z]*'[A-Z]*)", "\\1", str);
  # collapse whitespace
  str <- gsub('[[:space:]]+', ' ', str);
  # make sure contraction's are "tight"
  str <- gsub(" ?' ?", "'", str);
  # make sure terminal . are tight
  str <- gsub(' ?\\. ?', '.', str);

Oh we’re also going to need this-

# this function lets us create n-grams from a list
ngrams <- function(lst, n) {
    len <- length(lst);
    sapply(1:(len-n+1), function(i) do.call(paste, as.list(lst[i:(i+n-1)])))

Oh and you’ll need this library too!

# you'll need this to parse yelp reviews
install.packages("rjson", repos = "http://cran.stat.ucla.edu/")

The main loop

Loop through reviews and populate dictionaries (a.k.a. train the model). (I also provide comments for how to test your code without looping through the entire dataset.)

Note: I’m using a smaller dataset that can be found on my github page here. If you download it, be sure to unzip it before running the code below.

# connect to the file, but don't load the contents! 
Yelpfile <- file('yelp_academic_dataset_review.json', 'r', FALSE); 
# I'm using a smaller dataset than ^^ this. I cut the dataset down to roughly 10,000 reviews
# i <- 0 #create an index to break the loop early
repeat {
  # select the number of reviews to read at a time. 500 = ~550kb. 
  reviews <- readLines(con=Yelpfile, n=500);
  # Break loop when you reach the end of the file
  # if (i>1000){ #only loop through 1000 reviews for testing your loop
  if (length(reviews) == 0) { #comment out if you only want to test loop on first 1000 reviews
      # disconnect the file link
      # break the loop
  # read a single review 
  for (review in reviews){
    # parse the current review
    currev <- fromJSON(review)
    # clean the current review
    text <- clean(currev$text)
    # split reviews into sentences
    for (sentence in unlist(strsplit(text, '\\.'))) {
      # split to unigrams    
      unilist <- unlist(strsplit(sentence, ' ')) 
      # store unigrams
      # add beginning and end of sentence tags to unigrams, and subsequent n-grams 
      # (crucial for smoothing ngrams in test phase)
      bilist <- c("<BOS>",unilist,'<EOS>')
      # store bigrams, use the "ngrams" function bind unigrams together
      # store trigrams    
      trilist <- c("<BOS>","<BOS>",unilist,'<EOS>')
      # store quadgrams
      qualist <- c("<BOS>","<BOS>","<BOS>",unilist,'<EOS>')
  # cat('\r', paste('Trained', i, 'lines from yelp.')); #this will track your progress through your dataset!

As the size of the dataset increases storage increases exponentially when using dtm. With this implemenation of cmscu our storage scales linearly. Below is a Log-log plot of the calculation time of tm relative to the calculation time of cmscu, averaged over ten runs, for increasingly large data sets. Due to the nonlinear scaling of tm, our implementation becomes increasingly faster relative to tm as the dataset increases in size. The memory requirements of tm prevents the comparison of larger data sets.

Querying and Confidence and Diagnostics

Now that we have frequency dictionaries we can query them. We can also check how much we might be over estimating our counts by (e.g., uncertainty) and the probabiility that we’re not over-estimating by more than our uncertainty (e.g., confidence). We can also determine how densely populated the \(hash\) dictionaries are, as well as the total number of entries we’ve added, and finally the number of unique entries (important measures when determining something like the information density of a review given its star rating).

# query the unigram dictionary
## [1] 1190 1090    0
# query the bigram dictionary
yelp_2g$query(c("<BOS> I","THE PIZZA","SHE JUXTAPOSED"))
## [1] 15096   232     0
# The number of counts we might be over-estimating by
## [1] 0.05640735
# probability that we are not over-estimating by more than our 'uncertainty'
## [1] 0.9816844
#how full the your hash tables are. zero means there's no risk in collision
## [1] 0.001744136
# number of total n-grams in the corpus 
## [1] 1392583
# number of unique n-grams in the corpus
## [1] 29292

These diagnostics provide a measure of how well you’ve captured the fequency dictionary of your corpus while maintaining efficiency. Depending on the size of your corpus and the power of your machine, uncertainty (determined by your dictionary’s \(width\)) and confidence in that uncertainty (number of \(hashes\)) will change.

Again unless you know the size of your corpus a priori (e.g., the number of unique n-grams) it will be impossible to determine the size your dictionary should be to confident that you have zero uncertainty (over estimation) in the counts you’ve estimated for each n-gram.

Save Frequency Dictionaries to .bin

Finally, we can now save the dictionaries to binary files. The size of your binary file will always equal the size of the table x hash functions you’ve specified above. In our case ~\(1\) Gigabyte. That may seem big for some \(n\)-grams and not big enough for others, but as we now know, your computer can handle the sizes that you specify. In addition, what good is a \(n\)-gram corpus you can’t query due to memory overload?

Note: Keep your cleaning functions the same so you don’t accidently query your table and come up empty handed. For instance if you STORE in UPPERCASE but query in lowercase, you’re gonna have a bad time.

# save your dictionary to in .bin

The next step is to load it back in (which takes almost no time at all)

Load from .bin and Query!

# make sure your dictionary is the same size! 
frombinary_1g <- new(FrequencyDictionary, 4, 2^26)
# load it in
# query to test
## [1] 1190 1090    0
yelp_2g$query(c("<BOS> I","THE PIZZA","SHE JUXTAPOSED"))
## [1] 15096   232     0

Final Thoughts

Just a quick final thought - This package was developed to allow behavioral researchers access to a canonical NLP method (\(n\)-grams) to quantify larger, now freely available language datasets.. It can be used to acquire a probablistic estimate of n-gram counts relative to your own machine’s computing constraints.

Later, I will provide another tutorial where I will show you how cmscu allows you to use computationally heavy smoothing techniques, such as “Modified Kneser-Ney” for quanifying missing \(n\)-grams. I show how this can be used within an information-theoretic context, again, using millions of yelp reviews. (But if you’re antsy and want it now, email me. I’m happy to help out!)

Other Resources