Data Science Notes

text2vec implementation details. Writing fast parallel asynchronous SGD/AdaGrad.

2016-01-09T00:00:00+00:00

Before reading this post, I very recommend to read:

Orignal GloVe paper
Jon Gauthier’s post, which provides detailed explanation of python implementation. This post helps me a lot with C++ implementation.

Word embedding

After Tomas Mikolov et al. released word2vec tool, there was a boom of articles about words vector representations. One of the greatest is GloVe, which did a big thing while explaining how such algorithms work and refolmulating word2vec optimizations as special kind of factoriazation for word cooccurences matrix.

This post will consists of two main parts:

Very brief introduction into GloVe algorithm.
Details of implementation. I will show how to write fast, parallel asynchronous SGD with RcppParallel with adaptive learning rate in C++ using Intel TBB and RcppParallel.

Introduction to GloVe algorithm

GloVe algorithm consists of following steps:

Collect word cooccurence statistics in a form of word coocurence matrix $X$ . Each element $X_{ij}$ of such matrix represents measure of how often word i appears in context of word j. Usually we scan our corpus in followinf manner: for each term we look for context terms withing some area - window_size before and window_size after. Also we give less weight for more distand words (usually $decay = 1/offset$ ).
Define soft constraint for each word pair: $w_i^Tw_j + b_i + b_j = log(X_{ij})$ . Here $w_i$ - vector for main word, $w_j$ - vector for context word, $b_i$ , $b_j$ - scalar biases for main and context words.
Define cost function $J = \sum_{i=1}^V \sum_{j=1}^V \; f(X_{ij}) ( w_i^T w_j + b_i + b_j - \log X_{ij})^2$ . Here $f$ is a weighting function which help us to prevent learning only from exremly common word pairs. GloVe authors choose following fucntion:

$f(X_{ij}) = \begin{cases} (\frac{X_{ij}}{x_{max}})^\alpha & \text{if } X_{ij} < XMAX \\ 1 & \text{otherwise} \end{cases}$

Implementation

Main challenges I faced during implementation:

Efficient cooccurence matrix creation.
Implementation of efficient SGD for objective cost function minimization.

Cooccurence matrix creation

There are a few main issues with term cooccurence matrix (tcm):

tcm should be sparse. We should be able to construct tcm for large vocabularies ( > 100k words).
Fast lookups/inserts.

To meet requirement of sparsity we need to store data in associative array. unordered_map is good candidate because of $O(1)$ lookups/inserts complexity. I ended with std::unordered_map< std::pair<uint32_t, uint32_t>, T > as container for sparse matrix in triplet form. Performance of unordered_map heavily depends on underlying hash fucntion. Fortunately, we can pack pair<uint32_t, uint32_t> into single uint64_t in a determenistic way without any collisions.
Hash function for std::pair<uint32_t, uint32_t> will look like:

namespace std {
  template <>
  struct hash<std::pair<uint32_t, uint32_t>>
  {
    inline uint64_t operator()(const std::pair<uint32_t, uint32_t>& k) const
    {
      return (uint64_t) k.first << 32 | k.second;
    }
  };
}

For details see this and this stackoverflow questions.

Also note, that our cooccurence matrix is symmetrical, so internally we will store only elements above main diagonal.

Stochastic gradient descent

Now we should implement efficient parallel asynchronous stochastic gradient descent for word cooccurence matrix factorization, which is proposed in GloVe paper. Interesting thing - SGD inherently is serial algoritms, but when your problem is sparse, you can do asynchronous updates without any locks and achieve speedup proportional to number of cores on your machine! If you didn’t read HOGWILD!, I recomment to do that.

Let me remind formulation of SGD. We try to move $x_t$ parameters in a minimizing direction, given by $−g_t$ with a learning rate $\alpha$ :

$x_{t+1} = x_t − \alpha g_t$

So, we have to calculate gradients for our cost function:

$J = \sum_{i=1}^V \sum_{j=1}^V f(X_{ij}) ( w_i^T w_j + b_i + b_j - \log X_{ij} )^2$ :

$\frac{\partial J}{\partial w_i} = f(X_{ij}) w_j ( w_i^T w_j + b_i + b_j - \log X_{ij})$ $\frac{\partial J}{\partial w_j} = f(X_{ij}) w_i ( w_i^T w_j + b_i + b_j - \log X_{ij})$ $\frac{\partial J}{\partial b_i} = f(X_{ij}) (w_i^T w_j + b_i + b_j - \log X_{ij})$ $\frac{\partial J}{\partial b_j} = f(X_{ij}) (w_i^T w_j + b_i + b_j - \log X_{ij})$

AdaGrad

We will use modification of SGD - AdaGrad algoritm. It automaticaly determines per-feature learning rate by tracking historical gradients, so that frequently occurring features in the gradients get small learning rates and infrequent features get higher ones. For AdaGrad implementation deteails see excellents Notes on AdaGrad by Chris Dyer.

Formulation of AdaGrad step $t$ and feature $i$ is following:

$x_{t+1, i} = x_{t, i} − \frac{\alpha}{\sqrt{\sum_{\tau=1}^{t-1}} g_{\tau,i}^2} g_{t,i}$

As we can see, at each iteration $t$ we need to keep track of sum over all historical gradients.

Parallel asynchronous AdaGrad

Actually we will use modification of AdaGrad - HOGWILD-style asynchronous AdaGrad :-) Main idea of HOGWILD! algorithm is very simple - don’t use any syncronizations. If your problem is sparse, allow threads to overwrite each other! This works and works fine. Again, see HOGWILD! paper for details and theoretical proof.

Code

Now lets put all into the code.

As seen from analysis above, GloveFit class should consists following parameters:

word vecvors w_i, w_j (for main and context words).
biases b_i, b_j.
word vectors square gradients grad_sq_w_i, grad_sq_w_j for adaptive learning rates.
word biases square gradients grad_sq_b_i, grad_sq_b_j for adaptive learning rates.
lerning_rate, max_cost and other scalar model parameters.

class GloveFit {
private:
  size_t vocab_size, word_vec_size;
  double x_max, learning_rate;
  // see https://github.com/maciejkula/glove-python/pull/9#issuecomment-68058795
  // clips the cost for numerical stability
  double max_cost;
  // initial learning rate
  double alpha;
  // word vecrtors
  vector<vector<double> > w_i, w_j;
  // word biases
  vector<double> b_i, b_j;
  // word vectors square gradients
  vector<vector<double> > grad_sq_w_i, grad_sq_w_j;
  // word biases square gradients
  vector<double> grad_sq_b_i, grad_sq_b_j;
}

Single iteration

Now we should to initialize parameters and perform iteration of SGD:

//init cost
global_cost = 0
// assume tcm is sparse matrix in triplet form - <i, j, x>
for_each (<i, j, x> ) {
  //compute cost function and add it to global cost
  global_cost += J(x)
  //Compute gradients for bias terms and perform adaptive updates for bias terms
  //Compute gradients for word vector terms and perform adaptive updates for word vectors
  //Update squared gradient sums for word vectors
  //Update squared gradient sums for biases
}
return global_cost;

For actual text2vec code (with a few tricks) check this loop.

OpenMP

As discussed above, all these steps can be performed in parallel loop (over all non-zero word-coocurence scores). This can be easily done via OpenMP parallel for and reduction: #pragma omp parallel for reduction(+:global_cost). But there is one significant issue with this approach - it is very hard to make portable R-package with OpenMP support. By default it will work only on linux distributions, because:

default clang on OS X don’t support OpenMP (of course you can install clang-omp or gcc from brew, but this also could be tricky).
Rtools begins support of OpenMP on Windows only in 2015. But even modern realization has substantial overheads.

For more details see OpenMP-support section of Writing R Extensions manual.

Intel TBB

Luckily we have a better alternative - Intel Thread Building Blocks library and RcppParallel package which provides RVector and RMatrix wrapper classes for safe and convenient access to R data structures in a multi-threaded environment! Moreover it “just works” on main platforms - OS X, Windows, Linux. Have very positive experience with this library, thanks to Rcpp Core team and especially to JJ Allaire.

Using TBB is little bit trickier, then writing simple OpenMP #pragma directives. You should implement functor which operates on a chunk of data and call parallelReduce or parallelFor on entire data collection. You can find useful (and simple) examples at RcppParallel examples section.

Putting all together

For now suppose, we have partial_fit method in GloveFit class with following signature (see actual code here):

double partial_fit( size_t begin, 
                    size_t end, 
                    const RVector<int> &x_irow, 
                    const RVector<int> &x_icol, 
                    const RVector<double> &x_val);

It takes

tcm in sparse triplet form <x_irow, x_icol, x_val>
begin and end pointers for a range on which we want to perform our SDG.

And performs SGD steps over this range - updates word vectors, gradients, etc. At the end it retruns value of accumulated cost function. Note, that internally this method modifies values members of the class.

Also note, that signature of partial_fit is very similar to what we have to implement in our TBB functor. Now we are ready to write it:

struct AdaGradIter : public Worker {
  RVector<int> x_irow, x_icol;
  RVector<double> x_val;
  GloveFit &fit;

  int vocab_size, word_vec_size, num_iters, x_max;
  double learning_rate;

  // accumulated value
  double global_cost;

  // function to set global_cost = 0 between iterations
  void set_cost_zero() { global_cost = 0; };
  
  //init function to use between iterations
  void init(const IntegerVector &x_irowR,
            const IntegerVector &x_icolR,
            const NumericVector &x_valR,
            const IntegerVector &iter_orderR,
            GloveFit &fit) {
    x_irow = RVector<int>(x_irowR);
    x_icol = RVector<int>(x_icolR);
    x_val  = RVector<double> (x_valR);
    iter_order = RVector<int> (iter_orderR);
    fit = fit;
    global_cost = 0;
  }
  // dummy constructor
  // used at first initialization of GloveFitter
  AdaGradIter(GloveFit &fit):
    x_irow(IntegerVector(0)),
    x_icol(IntegerVector(0)),
    x_val(NumericVector(0)),
    iter_order(IntegerVector(0)),
    fit(fit) {};

  // constructors
  AdaGradIter(const IntegerVector &x_irowR,
              const IntegerVector &x_icolR,
              const NumericVector &x_valR,
              GloveFit &fit):
    x_irow(x_irowR), x_icol(x_icolR), x_val(x_valR),
    fit(fit), global_cost(0) {}
    
  // constructor callesd at split
  AdaGradIter(const AdaGradIter& AdaGradIter, Split):
    x_irow(AdaGradIter.x_irow), x_icol(AdaGradIter.x_icol), x_val(AdaGradIter.x_val), 
    fit(AdaGradIter.fit), global_cost(0) {}

  // process just the elements of the range
  void operator()(size_t begin, size_t end) {
    global_cost += this->fit.partial_fit(begin, end, x_irow, x_icol, x_val);
  }
  
  // join my value with that of another global_cost
  void join(const AdaGradIter& rhs) {
    global_cost += rhs.global_cost;
  }
};

As you can see, it is very similar to example form RcppParallel site. One diffrence - it has side-effects. By calling partial_fit it modifies internal state of the input instance of GloveFit class (which actually contains our GloVe model).

Now lets write GloveFitter class, which will be callable from R via Rcpp-modules. It will act as interface for fitting our model and take all input model parameters such as vocabulary size, desired word vectors size, initial AdaGrad learning rate, etc. Also we want to track cost between iterations and want to be able to perform some early stopping strategy between SGD iterations. For that purpose we keep our model in C++ class, so we can modify it “in place” at each SGD iteration (which can be problematic in R)

class GloveFitter {
public:
  GloveFitter(size_t vocab_size,
            size_t word_vec_size,
            uint32_t x_max,
            double learning_rate,
            double max_cost,
            double alpha):
  gloveFit(vocab_size,  word_vec_size, learning_rate, x_max, max_cost, alpha),
  adaGradIter(gloveFit) {}
  
  // function to set cost to zero from R (used between SGD iterations)
  void set_cost_zero() {adaGradIter.set_cost_zero();};

  double fit_chunk(const IntegerVector x_irow,
                   const IntegerVector  x_icol,
                   const NumericVector x_val,
                   const IntegerVector iter_order) {
    
    this->adaGradIter.init(x_irow, x_icol, x_val, iter_order, gloveFit);
    // 
    parallelReduce(0, x_irow.size(), adaGradIter);
    return (this->adaGradIter.global_cost);
  }
  // export word vectors to R
  List get_word_vectors() {
    return List::create(_["word_vectors"] = adaGradIter.fit.get_word_vectors());
  }

private:
  GloveFit gloveFit;
  AdaGradIter adaGradIter;
};

And create wrapper with Rcpp-Modules:

RCPP_MODULE(GloveFitter) {
  class_< GloveFitter >( "GloveFitter" )
  //<vocab_size, word_vec_size, x_max, learning_rate, grain_size, max_cost, alpha>
  .constructor<size_t, size_t, uint32_t, double, uint32_t, double, double>()
  .method( "get_word_vectors", &GloveFitter::get_word_vectors, "returns word vectors")
  .method( "set_cost_zero", &GloveFitter::set_cost_zero, "sets cost to zero")
  .method( "fit_chunk", &GloveFitter::fit_chunk, "process TCM data chunk")
  ;
}

Now we can use GloveFitter class from R:

fit <- new( GloveFitter, vocabulary_size, word_vectors_size, x_max, 
            learning_rate, grain_size, max_cost, alpha)
NUM_ITER <- 10
for(i in seq_len(NUM_ITER)) {
  cost <- fit$fit_chunk(tcm@i, tcm@j, tcm@x, iter_order)
  print(cost)
  fit$set_cost_zero()
}

Experiments on english wikipedia. GloVe and word2vec.

2015-12-01T00:00:00+00:00

Today I will start to publish series of posts about experiments on english wikipedia. As I said before, text2vec is inspired by gensim - well designed and quite efficient python library for topic modeling and related NLP tasks. Also I found very useful Radim’s posts, where he tried to evaluate some algorithms on english wikipedia dump. This dataset is rather big. For example, dump for 2015-10 (which will be used below) is 12gb bzip2 compressed file. In uncompressed form it takes about 50gb. So I can’t call it a “toy” dataset :-) You can download original files here. We are interested in file which ends with “pages-articles.xml.bz2”.

All evaluation and timings were done on my macbook laptop with intel core i7 cpu and 16gb of ram.

You can find all the code in the post repository.

Preparation

After getting enwiki dump we should clean it - remove wiki xml markup. I didn’t implement this stage in text2vec, so we will use gensim’s scripts - and especially file prepare_shootout.py. It is not very hard to implement it in R, but this is not top priority for me at the moment. So if anybody is willing to help - please see this issue.

After cleaning we will have “title_tokens.txt.gz” file, which represents wikipedia articles - one article per line. Also each line consists of two tab-separated("\t") parts - title of the article and text of the article. Texts consists of space-separated (" ") words in lowercase.

R I/O tricks

R’s base::readLines() is very generic function to read lines of characters from files/connections. And because ot that, readLines() is very slow. So in text2vec I use readr::read_lines() which more then 10x faster. readr is a relatively new package and it has one significant drawback - it doesn’t have streaming API. This means you can’t read file line-by-line - you can only read whole file in a single function call. Sometimes this can become an issue, but usually not - user can manually split big file into chunks using command line tools and work with them. Moreover, if your perform analysis on really large amounts of data, you probably use Apache Spark/Hadoop to prepare input. And usually data is stored in chunks of 64/128Mb in hdfs, so it is very natural to work with such chunks instead of single file.

For this post, I splitted title_tokens.txt.gz into 100mb chunks using split command utility:

gunzip -c title_tokens.txt.gz | split --line-bytes=100m --filter='gzip --fast > ~/Downloads/datasets/$FILE.gz'

If you are on OS X, install coreutils first: brew install coreutils and use gsplit command:

gunzip -c title_tokens.txt.gz | gsplit --line-bytes=100m --filter='gzip --fast > ~/Downloads/datasets/$FILE.gz'

In all the code below we will use title_tokens.txt.gz file as input for gesnim and title_tokens_splits/ directory as input for text2vec.

Word embeddings

Here I want to demonstrate how to use text2vec’s GloVe implementation and briefly compare its performance with word2vec. Originally I had plans to implement word2vec, but after reviewing GloVe paper, I changed my mind. If you still haven’t read it, I strongly recommend to do that.

So, this post has several goals:

Demonstrate how to process large collections of documents (that don’t fit into RAM) with text2vec.
Provide tutorial on text2vec GloVe word embeddings functionality.
Compare text2vec GloVe and gensim word2vec in terms of:
1. accuracy
2. execution time
3. RAM consumption
Briefly highlight advantages and drawbacks of current implementation. (I’ll write separate post with more details about technical aspects.)

Baseline

Here we will follow excellent Radim’s Making sense of word2vec post and try to replicate his results.

Just to remind results

## Loading required package: methods

You can find corresponding original repository here.

Modifications

I made a few minor modifications in Radim’s code.

I don’t evaluate glove-python for the following reasons:
1. Radim uses dense numpy matrix to store cooccurencies. While it is great for 30K vocabulary (float32 dense matrix occupies ~ 3.6gb and it takes less time to fill it), it is not appropriate for larger vocabularies (for example float32 matrix for 100K vocabulary will occupy ~ 40gb).
2. Orginal glove-python creates sparse cooccurence matrix, but for some reason it has very poor performance (accuracy on analogue task ~1-2%). I’m not very familiar with python, so can’t figure out what is wrong. If somebody can fix this issue - let me know, I would be happy to add glove-python to this comparison.
Construct vocabulary from top 30k words produced by text2vec vocabulary builder. gensim takes into account title of the article, which can contain upper-case words, punctuation, etc. I found that models which are based on vocabulary constructed from only articles body (not incuding title) are more accurate. This is true for both, GloVe and word2vec.

Building the model

I will focus on text2vec details here, because gensim word2vec code is almost the same as in Radim’s post (again - all code you can find in this repo).

Install test2vec from github:

devtools::install_github('dselivanov/text2vec')

Vocabulary

First of all we need to build a vocabulary:

library(text2vec)
# create iterator over files in directory
it <- idir('~/Downloads/datasets/splits/')
# create iterator over tokens
it2 <- itoken(it, 
              preprocess_function = function(x) 
                str_split(x, fixed("\t")) %>% 
                # select only the body of the article
                sapply(.subset2, 2), 
              tokenizer = function(x) str_split(x, fixed(" ")))
vocab <- vocabulary(it2)

On my machine it takes about 1150 sec, while gensim gensim.corpora.Dictionary() takes about 2100 sec. RAW I/O is about ~ 150 sec.

Pruning vocabulary

We received all unique words and corresponding statistics:

str(vocab)
##List of 2
## $ vocab:'data.frame':	8306153 obs. of  4 variables:
##  ..$ terms          : chr [1:8306153] "bonnerj" "beerworthc" "danielst" "anchaka" ...
##  ..$ terms_counts   : int [1:8306153] 1 1 1 1 1 1 1 1 1 1 ...
##  ..$ doc_counts     : int [1:8306153] 1 1 1 1 1 1 1 1 1 1 ...
##  ..$ doc_proportions: num [1:8306153] 2.55e-07 2.55e-07 2.55e-07 2.55e-07 2.55e-07 ...
## $ ngram: Named int [1:2] 1 1
##  ..- attr(*, "names")= chr [1:2] "ngram_min" "ngram_max"
## - attr(*, "class")= chr "text2vec_vocabulary"

But we are interested in only frequent words, so we should filter out rare words. text2vec provides prune_vocabulary() function which has many useful options for that:

TOKEN_LIMIT = 30000L
# filter out tokens which are represented at least in 30% of documents
TOKEN_DOC_PROPORTION = 0.3
pruned_vocab <- prune_vocabulary(vocabulary = vocab, 
                                     doc_proportion_max = TOKEN_DOC_PROPORTION, 
                                     max_number_of_terms = TOKEN_LIMIT)
# save to csv to use in gensim word2vec
write.table(x = data.frame("word" = pruned_vocab$vocab$terms, "id" = 0:(TOKEN_LIMIT - 1)), 
            file = "/path/to/destination/dir/pruned_vocab.csv",
            quote = F, sep = ',', row.names = F, col.names = F)

Corpus construction

Now we have vocabulary and can construct term-cooccurence matrix (tcm).

WINDOW = 10L
# create iterator over files in directory
it <- idir('~/Downloads/datasets/splits/')
# create iterator over tokens
it2 <- itoken(it, 
              preprocess_function = function(x) 
                str_split(x, fixed("\t")) %>% 
                # select only the body of the article
                sapply(.subset2, 2), 
              tokenizer = function(x) str_split(x, fixed(" ")))
# create_vocab_corpus can construct documen-term matrix and term-cooccurence matrix simultaneously
# here we are not interesting in documen-term matrix, so set `grow_dtm = FALSE`
corpus <- create_vocab_corpus(it2, vocabulary = pruned_vocab, grow_dtm = FALSE, skip_grams_window = WINDOW)
# in this call, we wrap std::unordered_map into R's dgTMatrix
tcm <- get_tcm(corpus)

Operation above takes about 80 minutes on my machine and at peak consumes about 11gb of RAM.

Short note on memory consumption

At the moment, during corpus construction, text2vec keeps entire term-cooccurence matrix in memory. In future versions it can be changed (Quite easily via simple map-reduce style algorithm. Exactly the same way it done in original Stanford implementation).

As you can see memory consumption is rather high. But if we do some basic calculations we will realize that memory consumption is not so high:

Internally text2vec stores tcm as std::unordered_map<std::pair<uint32_t, uint32_t>, float>.
We store only elements which are above main diagonal, because our matrix is symmetric.
tcm above consists of ~ 200e6 elements. Matrix is quite dense - ~ 22% non-zero elements (above diagonal! in case of full storage, matrix will be ~ 44% dense).

So, 200e6 * (4 + 4 + 8) = ~ 3.2 gb - only memory to store our matrix in sparse triplet form using preallocated vectors. Also we should add usual 3-4x std::unordered_map overhead and memory allocated for wrapping unordered_map into R sparse triplet dgTMatrix.

GloVe training

We perform GloVe fitting using AdaGrad - stochastic gradient descend with per-feature adaptive learning rate. Also, fitting is done in fully parallel and asynchronous manner ( see Hogwild! paper ), so it can benefit from machines with multiple cores. In my tests I achieved almost 8x speedup on 8 core machine on the discribed above wikipedia dataset.

Now we are ready to train our GloVe model. Here we will perform maximum 20 iterations. Also we will track our global cost and its improvement over iterations. We will stop fitting when improvement (in relation to previous epoch) will become smaller than given threshold - convergence_threshold.

DIM = 600L
X_MAX = 100L
WORKERS = 4L
NUM_ITERS = 20L
CONVERGENCE_THRESHOLD = 0.005
LEARNING_RATE = 0.15
# explicitly set number of threads
RcppParallel::setThreadOptions(numThreads = WORKERS)

fit <- glove(tcm = tcm, 
             word_vectors_size = DIM, 
             num_iters = NUM_ITERS,
             learning_rate = LEARNING_RATE,
             x_max = X_MAX, 
             shuffle_seed = 42L, 
             # we will stop if global cost will be reduced less then 0.5% then previous SGD iteration
             convergence_threshold = CONVERGENCE_THRESHOLD)

This takes about 431 minutes on my machine and stops on 20 iteration (no early stopping):

2015-12-01 06:37:27 - epoch 20, expected cost 0.0145

Accuracy on analogue dataset is 0.759:

words <- rownames(tcm)
m <- fit$word_vectors$w_i + fit$word_vectors$w_j
rownames(m) <-  words

questions_file <- '~/Downloads/datasets/questions-words.txt'
qlst <- prepare_analogue_questions(questions_file, rownames(m))
res <- check_analogue_accuracy(questions_lst = qlst, m_word_vectors = m)

2015-12-01 06:48:23 - capital-common-countries: correct 476 out of 506, accuracy = 0.9407
2015-12-01 06:48:27 - capital-world: correct 2265 out of 2359, accuracy = 0.9602
2015-12-01 06:48:28 - currency: correct 4 out of 86, accuracy = 0.0465
2015-12-01 06:48:31 - city-in-state: correct 1828 out of 2330, accuracy = 0.7845
2015-12-01 06:48:32 - family: correct 272 out of 306, accuracy = 0.8889
2015-12-01 06:48:33 - gram1-adjective-to-adverb: correct 179 out of 650, accuracy = 0.2754
2015-12-01 06:48:33 - gram2-opposite: correct 131 out of 272, accuracy = 0.4816
2015-12-01 06:48:34 - gram3-comparative: correct 806 out of 930, accuracy = 0.8667
2015-12-01 06:48:35 - gram4-superlative: correct 279 out of 506, accuracy = 0.5514
2015-12-01 06:48:37 - gram5-present-participle: correct 445 out of 870, accuracy = 0.5115
2015-12-01 06:48:39 - gram6-nationality-adjective: correct 1364 out of 1371, accuracy = 0.9949
2015-12-01 06:48:41 - gram7-past-tense: correct 836 out of 1406, accuracy = 0.5946
2015-12-01 06:48:42 - gram8-plural: correct 833 out of 1056, accuracy = 0.7888
2015-12-01 06:48:43 - gram9-plural-verbs: correct 341 out of 600, accuracy = 0.5683
2015-12-01 06:48:43 - OVERALL ACCURACY = 0.7593

Note, that sometimes AdaGrad converges to poorer local minima with larger cost. This means model will produce less accurate predictions. For example in some experiments while writing this post I stopped with cost = 0.190 and accuracy = ~ 0.72. Also fitting can be sensitive to initial learning rate, some experiments still needed.

Training word2vec takes 401 minutes and accuracy = 0.687.

As we can see, GloVe shows significantly better accuaracy.

Closer look to resources usage:

Faster training

If you are more interested in training time you can do the following:

Make more zeros in tcm by removing too rare cooccurences:

ind <- tcm@x >= 1
    tcm@x <- tcm@x[ind]
    tcm@i <- tcm@i[ind]
    tcm@j <- tcm@j[ind]

10 iterations with lower word vector dimensions (DIM = 300 for example).

DIM <- 300L
    NUM_ITERS <- 10
    fit <- glove(tcm = tcm, 
                 word_vectors_size = DIM, 
                 num_iters = NUM_ITERS,
                 learning_rate = 0.1,
                 x_max = X_MAX, 
                 shuffle_seed = 42L, 
                 max_cost = 10,
                 # we will stop if global cost will be reduced less then 1% then previous SGD iteration
                 convergence_threshold = 0.01)
    words <- rownames(tcm)
    m <- fit$word_vectors$w_i + fit$word_vectors$w_j
    rownames(m) <-  words
    questions_file <- '~/Downloads/datasets/questions-words.txt'
    qlst <- prepare_analogue_questions(questions_file, rownames(m))
    res <- check_analogue_accuracy(questions_lst = qlst, m_word_vectors = m)

Training takes 50 minutes on 4-core machine and get ~68% accuracy:

2015-11-30 15:13:51 - capital-common-countries: correct 482 out of 506, accuracy = 0.9526
2015-11-30 15:13:54 - capital-world: correct 2235 out of 2359, accuracy = 0.9474
2015-11-30 15:13:54 - currency: correct 1 out of 86, accuracy = 0.0116
2015-11-30 15:13:57 - city-in-state: correct 1540 out of 2330, accuracy = 0.6609
2015-11-30 15:13:57 - family: correct 247 out of 306, accuracy = 0.8072
2015-11-30 15:13:58 - gram1-adjective-to-adverb: correct 142 out of 650, accuracy = 0.2185
2015-11-30 15:13:58 - gram2-opposite: correct 87 out of 272, accuracy = 0.3199
2015-11-30 15:13:59 - gram3-comparative: correct 663 out of 930, accuracy = 0.7129
2015-11-30 15:14:00 - gram4-superlative: correct 171 out of 506, accuracy = 0.3379
2015-11-30 15:14:01 - gram5-present-participle: correct 421 out of 870, accuracy = 0.4839
2015-11-30 15:14:03 - gram6-nationality-adjective: correct 1340 out of 1371, accuracy = 0.9774
2015-11-30 15:14:04 - gram7-past-tense: correct 608 out of 1406, accuracy = 0.4324
2015-11-30 15:14:06 - gram8-plural: correct 771 out of 1056, accuracy = 0.7301
2015-11-30 15:14:06 - gram9-plural-verbs: correct 266 out of 600, accuracy = 0.4433
2015-11-30 15:14:06 - OVERALL ACCURACY = 0.6774

Summary

Advantages

As we see text2vec’s GloVe implementation looks like a good alternative to word2vec and outperforms it in terms of accuracy and running time (we can pick a set of parameters on which it will be both faster and more accurate).
Early stopping. We can stop training when improvements become small.
tcm is reusable. May be it is more fair to subtract timings for tcm_creation from benchmarks above.
Incremental fitting. You can easily adjust tcm with new data and continue fitting. And it will converge very quickly.
text2vec works on OS X, Linux and even Windows without any tricks/adjustments/manual configuration. Thanks to Intel Thread Building Blocks and RcppParallel. It was a little bit simpler to program AdaGrad using OpenMP (as I actually did in my first attempt), but this leads to issues at installation time, especially on OS X and Windows machines.

Drawbacks and what can be improved

One drawback - it uses a lot of memory (in contrast to gensim which is very memory-friendly. I was very impressed.). But this is natural - the fastest way to costruct tcm is to keep it in RAM as hash map and perform cooccurence increments in a global manner. Also note that we build tcm on top 30000 terms. Because of that it is very dense. I tried to build model on top 100000 terms and had no problems on machine with 32gb RAM. Matrix was much more sparse - ~ 4% of non zero elements. Anyway, one can implement simple map-reduce style algorithm to construct tcm. (using files as it done in original Stanford implemention) and then fit model in streaming manner.
Sometimes model is quite sensitive to initial learning rate.

Analyzing texts with text2vec package.

2015-11-09T00:00:00+00:00

In the last weeks I have actively worked on text2vec (formerly tmlite) - R package, which provides tools for fast text vectorization and state-of-the art word embeddings.

This project is an experiment for me - what can a single person do in a particular area? After these hard weeks, I believe, he can do a lot.

There are a lot of changes from my previous introduction post, and I want to highlight few of them:

Package was renamed to text2vec, because, I believe, this name better reflects its functionality.
New API. More clean, more concise.
GloVe word embeddings. Training is fully parallelized - asynchronous SGD with adaptive learning rate (AdaGrad). Works on all platforms, including windows.
Added ngram feature to vectorization. Now it is very easy to build Document-Term matrix, using arbitrary ngrams instead of simple unigrams.
Switched to MurmurHash3 for feature hashing and add signed_hash option, which can reduce the effect of collisions.
Now text2vec uses regular exressions engine from stringr package (which is built on top of stringi). Now regexp_tokenizer much is more fast and robust. Simple word_tokenizeris also provided.

In this post I’ll focus on text vectorization tools provided by text2vec. Also, it will be a base for a text2vec vignette. I’ll write another post about GloVe next week, don’t miss it.

Plese, don’t forgive to install text2vec first:

devtools::install_github('dselivanov/text2vec')

Features

text2vec is a package for which the main goal is to provide an efficient framework with concise API for text analysis and natural language processing (NLP) in R. It is inspired by gensim - an excellent python library for NLP.

Core functionality

At the moment we cover two following topics:

Fast text vectorization on arbitrary n-grams.
- using vocabulary
- using feature hashing
State-of-the-art GloVe word embeddings.

Efficiency

The core of the functionality is carefully written in C++. Also this means text2vec is memory friendly.
Some parts (GloVe training) are fully parallelized using an excellent RcppParallel package. This means, parallel features work on OS X, Linux, Windows and Solaris(x86) without any additinal tuning/hacking/tricks.
Streaming API, this means users don’t have to load all the data into RAM. text2vec allows processing streams of chunks.

API

Built around iterator abstraction.
Concise, provides only a few functions, which do their job well.
Don’t (and probably will not in future) provide trivial very high-level functions.

Terminology and what is under the hood

As stated before, text2vec is built around streaming API and iterators, which allows the constructin of the corpus from iterable objects. Here we touched 2 main concepts:

Corpus. In text2vec it is an object, which contains tokens and other information / metainformation which is used for text vectorization and other processing. We can be efficiently insert documents into corpus, because, technically, Corpus is an C++ class, wrapped with Rcpp-modules as reference class (which has reference semantics!). Usually user should not care about this, but should keep in mind nature of such objects. Particularly important, that user have to remember, that he can’t save/serialize such objects using R’s save*() methods. But good news is that he can easily and efficiently extract corresponding R objects from corpus and work with them in a usual way.
Iterators. If you are not familliar with them in R's context, I highly recommend to review vignettes of iterators package. A big advantage of this abstraction is that it allows us to be agnostic of type of input - we can transparently change it by just providing correct iterator.

Text vectorization

Historically, most of the text-mining and NLP modelling was related to Bag-of-words or Bag-of-ngrams models. Despite of simplicity, these models usually demonstrates good performance on text categorization/classification tasks. But, in contrast to theoretical simplicity and practical efficiency, building bag-of-words models involves technical challenges. Especially within R framework, because of its typical copy-on-modify semantics.

Pipeline

Lets briefly review some details of typical analysis pipeline:

Usually reseacher have to construct Document-Term matrix (DTM) from imput documents. Or in other words, vectorize text - create mapping from words/ngrams to vector space.
Fit model on this DTM. This can include:
- text classification
- topic modeling
- …
Tune, validate model.
Apply model on new data.

Here we will discuss mostly first stage. Underlying texts can take a lot of space, but vectorized ones usually not, because they are stored in form of sparse matrices. In R it is not very easy (from reason above - copy-on-modify semantics) to iteratively grow DTM. So construction of such objects, even for small collections of documents, can become serious hedache for analysts and researchers. It involves reading the whole collection of text documents into RAM and process it as single vector, which easily increase memory consumption by factor of 2 to 4 (to tell the truth, this is quite optimistically). Fortunately, there is a better, text2vec way. Lets check how it works on simple example.

Sentiment analysis on IMDB moview review dataset

text2vec provides movie_review dataset. It consists of 25000 movie review, each of which marked ad positive or negative.

library(text2vec)

## Loading required package: methods

data("movie_review")
# str(movie_review, nchar.max = 20, width = 80, strict.width = 'wrap')

To represent documents in vector space, first of all we have to create term -> term_id mappings. We use termin term instead of word, because actually it can be arbitrary ngram, not just single word. Having set of documents we want represent them as sparse matrix, where each row should corresponds to document and each column should corresponds to term. This can be done in 2 ways: using vocabulary, or by feature hashing (hashing trick).

Vocabulary based vectorization

Lets examine the first choice. He we collect unique terms from all documents and mark them with unique_id. vocabulary() function designed specially for this purpose.

it <- itoken(movie_review[['review']], preprocess_function = tolower, 
             tokenizer = word_tokenizer, chunks_number = 10, progessbar = F)
# using unigrams here
t1 <- Sys.time()
vocab <- vocabulary(it, ngram = c(1L, 1L))
print( difftime( Sys.time(), t1, units = 'sec'))

## Time difference of 3.587275 secs

# str(vocab, nchar.max = 20, width = 80, strict.width = 'wrap')

Now we can costruct DTM. Again, since all functions related to corpus construction have streaming API, we have to create iterator and provide it to create_vocab_corpus function:

it <- itoken(movie_review[['review']], preprocess_function = tolower, 
             tokenizer = word_tokenizer, chunks_number = 10, progessbar = F)
corpus <- create_vocab_corpus(it, vocabulary = vocab)
dtm <- get_dtm(corpus)

We got DTM matrix. Lets check its dimension:

dim(dtm)

## [1] 25000 85752

As you can see, it has 25000 rows (equal to number of documents) and 85752 columns (equal to number of unique terms). Now we are ready to fit our first model. Here we will use glmnet package to fit logistic regression with L1 penalty.

library(glmnet)
t1 <- Sys.time()
fit <- cv.glmnet(x = dtm, y = movie_review[['sentiment']], 
                 family = 'binomial', 
                 # lasso penalty
                 alpha = 1,
                 # interested area unded ROC curve
                 type.measure = "auc",
                 # 5-fold cross-validation
                 nfolds = 5,
                 # high value, less accurate, but faster training
                 thresh = 1e-3,
                 # again lower number iterations for faster training
                 # in this vignette
                 maxit = 1e3)
print( difftime( Sys.time(), t1, units = 'sec'))

## Time difference of 42.67177 secs

plot(fit)

print (paste("max AUC = ", round(max(fit$cvm), 4)))

## [1] "max AUC =  0.9457"

Note, that training time is quite high. We can reduce it and also significantly improve accuracy.

Pruning vocabulary

We will prune our vocabulary. For example we can find words “a”, “the”, “in” in almost all documents, but actually they don’t give any useful information. Usually they called stop words. But in contrast to them, corpus also contains very uncommon terms, which contained only in few documents. These terms also useless, because we don’t have sufficient statistics for them. Here we will filter them out:

# remove very common and uncommon words
pruned_vocab <- prune_vocabulary(vocab, term_count_min = 10,
 doc_proportion_max = 0.5, doc_proportion_min = 0.001)

it <- itoken(movie_review[['review']], preprocess_function = tolower, 
             tokenizer = word_tokenizer, chunks_number = 10, progessbar = F)
corpus <- create_vocab_corpus(it, vocabulary = pruned_vocab)
dtm <- get_dtm(corpus)

TF-IDF

Also we can (and usually should!) apply TF-IDF transofrmation, which will increase weight for document-specific terms and decrease weight for widely used terms:

dtm <- dtm %>% tfidf_transformer

## idf scaling matrix not provided, calculating it form input matrix

dim(dtm)

## [1] 25000 10535

Now, lets fit out model again:

t1 <- Sys.time()
fit <- cv.glmnet(x = dtm, y = movie_review[['sentiment']], 
                 family = 'binomial', 
                 # lasso penalty
                 alpha = 1,
                 # interested area unded ROC curve
                 type.measure = "auc",
                 # 5-fold cross-validation
                 nfolds = 5,
                 # high value, less accurate, but faster training
                 thresh = 1e-3,
                 # again lower number iterations for faster training
                 # in this vignette
                 maxit = 1e3)
print( difftime( Sys.time(), t1, units = 'sec'))

## Time difference of 19.19166 secs

plot(fit)

print (paste("max AUC = ", round(max(fit$cvm), 4)))

## [1] "max AUC =  0.9497"

As you can seem we obtain faster training, and larger AUC.

Can we do better?

Also we can try to use ngrams instead of words. We will use up to 3-ngrams:

it <- itoken(movie_review[['review']], preprocess_function = tolower, 
             tokenizer = word_tokenizer, chunks_number = 10, progessbar = F)

t1 <- Sys.time()
vocab <- vocabulary(it, ngram = c(1L, 3L))

print( difftime( Sys.time(), t1, units = 'sec'))

## Time difference of 21.42234 secs

vocab <- vocab %>% 
  prune_vocabulary(term_count_min = 10, doc_proportion_max = 0.5, doc_proportion_min = 0.001)

it <- itoken(movie_review[['review']], preprocess_function = tolower, 
             tokenizer = word_tokenizer, chunks_number = 10, progessbar = F)

corpus <- create_vocab_corpus(it, vocabulary = vocab)

print( difftime( Sys.time(), t1, units = 'sec'))

## Time difference of 32.06087 secs

dtm <- corpus %>% 
  get_dtm %>% 
  tfidf_transformer

## idf scaling matrix not provided, calculating it form input matrix

dim(dtm)

## [1] 25000 48462

t1 <- Sys.time()
fit <- cv.glmnet(x = dtm, y = movie_review[['sentiment']], 
                 family = 'binomial', 
                 # lasso penalty
                 alpha = 1,
                 # interested area unded ROC curve
                 type.measure = "auc",
                 # 5-fold cross-validation
                 nfolds = 5,
                 # high value, less accurate, but faster training
                 thresh = 1e-3,
                 # again lower number iterations for faster training
                 # in this vignette
                 maxit = 1e3)
print( difftime( Sys.time(), t1, units = 'sec'))

## Time difference of 23.21233 secs

plot(fit)

print (paste("max AUC = ", round(max(fit$cvm), 4)))

## [1] "max AUC =  0.9566"

So improved our model a little bit more. I’m leaving further tuning for the reader.

Feature hashing

If you didn’t hear anything about Feature hashing (or hashing trick), I recommend to start with wikipedia article and after that review original paper by Yahoo! research team. This techique is very fast - we don’t perform look up over associative array. But another benefit is very low memory footprint - we can map arbitrary number of features into much more compact space. This method was popularized by Yahoo and widely used in Vowpal Wabbit.

Here I will demonstrate, how to use feature hashing in text2vec:

t1 <- Sys.time()

it <- itoken(movie_review[['review']], preprocess_function = tolower, 
             tokenizer = word_tokenizer, chunks_number = 10, progessbar = F)

fh <- feature_hasher(hash_size = 2**18, ngram = c(1L, 3L))

corpus <- create_hash_corpus(it, feature_hasher = fh)
print( difftime( Sys.time(), t1, units = 'sec'))

## Time difference of 12.53 secs

dtm <- corpus %>% 
  get_dtm %>% 
  tfidf_transformer

## idf scaling matrix not provided, calculating it form input matrix

dim(dtm)

## [1]  25000 262144

t1 <- Sys.time()
fit <- cv.glmnet(x = dtm, y = movie_review[['sentiment']], 
                 family = 'binomial', 
                 # lasso penalty
                 alpha = 1,
                 # interested area unded ROC curve
                 type.measure = "auc",
                 # 5-fold cross-validation
                 nfolds = 5,
                 # high value, less accurate, but faster training
                 thresh = 1e-3,
                 # again lower number iterations for faster training
                 # in this vignette
                 maxit = 1e3)
print( difftime( Sys.time(), t1, units = 'sec'))

## Time difference of 54.91197 secs

plot(fit)

print (paste("max AUC = ", round(max(fit$cvm), 4)))

## [1] "max AUC =  0.947"

As you can see, we got a little bit worse AUC, but DTM construction time was considerably lower. On large collections of documents this can become a serious argument.

Introducing tmlite - new framework for text mining in R

2015-09-16T00:00:00+00:00

IMPORTANT NOTE

Code from this post is outdated (package APIs were changed).

See this post.

Today I am pleased to present tmlite - small, but fast and robust package for text-mining tasks in R. It is not availible yet on CRAN, but you can install it directly from github:

devtools::install_github("dselivanov/tmlite")

Reasonable question is - why new package? R already has such great package as tm and companion packages tau and NLP?

I’ll try to answer these questions in the last part of the post.

Focus

As unix philosophy says - Do One Thing and Do It Well, so we will focus on one particular problem - infrastructure for text analysis. R ecosystem contains lots of packages that are well suited for working with sparse high-dimensional data (and thus suitable for text modeling). Here are my favourites:

lda blazing fast package for topic modeling.
glmnet for L1, L2 linear models.
xgboost for gradient boosting.
LiblineaR - wrapper of liblinear svm library.
irlba - A fast and memory-efficient method for computing a few approximate singular values and singular vectors of large matrices.

These are all excellent and very efficient packages, so tmlite will be focused (at least in the nearest future) not on modeling, but on framework - Document-Matrix construction and manipulation - basis for any text-mining analysis. tmlite is partially inspired by gensim - robust and well designed python library for text mining. In the near future we will try to replicate some of its functionality.

tmlite is designed for practitioners (and kagglers!) who:

understand what they want and how to do that. So we will not expose trivial high-level API like findAssocs, findFreqTerms, etc.
work with medium to large collections of documents
have at least medium level of experience in R and know basic concepts of functional programming

Key features

Note that package is in very alpha version. This doesn’t mean the package is not robust, but this means that API can change at any time.

Flexible and easy functional-style API. Easy chaining.
Efficient and memory-friendly streaming corpus construction. tmlite’s provides API for construction corporas from character vectors and more important - connections. Read more about connections here. So it is possible (and easy!) to construct Document-Term matrices for collections of documents thar are don’t fit in the memory.
Fast - core functions are written in C++, thanks to Rcpp authors.
Has two main corpus classes -
- DictCorpus - traditional dictionary-based container used for Document-Term matrix construction.
- HashCorpus - container that implements feature hashing or “hashing trick”. Similar to scikit-learn FeatureHasher and gensim corpora.hashdictionary.
  
  The class HashCorpus is a high-speed, low-memory vectorizer that uses a technique known as feature hashing, or the “hashing trick”. Instead of building a hash table of the features encountered in training, as the vectorizers do, instances of HashCorpus apply a hash function to the features to determine their column index in sample matrices directly.
Document-Term matrix is key object. At the moment it can be extracted from corpus into dgCMatrix, dgTMatrix or LDA-C which is standart for lda package. dgCMatrix is default for sparse matrices in R and most of the packages that work with sparse matrices work with dgCMatrix matrices, so it will be easy to interact with them.

Quick reference

First quick example is based on kaggle’s Bag of Words Meets Bags of Popcorn competition data - labeledTrainData.tsv.zip.

Here I’ll demostrate flexibility of the corpus creation procedure and how to vectorize large collection of documents.

Suppose text file is very large, but it contains 3 tab-separated columns. Only one is relevant (third column in example below). Now we want to create corpus, but can’t read whole file into memory. See how this will be resolved. First load libraries:

library(methods)
library(tmlite)

## Loading required package: Matrix

# for pipe syntax
library(magrittr)

File contains 3 columns - id, sentiment, review. Only review is relevant.

Simple preprocessing function will do the trick for us - we will only read third column - text of the review.

# function receives character vector - batch of rows.
preprocess_fun <- function(x) {
  # file is tab-sepatated - split each row by \t
  rows <- strsplit(x, '\t', fixed = T)
  # text review is in the third column
  txt <- sapply(rows, function(x) x[[3]])
  # tolower, keep only letters
  simple_preprocess(txt) 
}

Read documents and create dictionary-based corpus:

# we don't want read all file into RAM - we will read it iteratively, row by row
path <- '~/Downloads/labeledTrainData.tsv'
con <- file(path, open = 'r', blocking = F)
corp <- create_dict_corpus(src = con, 
                   preprocess_fun = preprocess_fun, 
                   # simple_tokenizer - split string by whitespace
                   tokenizer = simple_tokenizer, 
                   # read by batch of 1000 documents
                   batch_size = 1000,
                   # skip first row - header
                   skip = 1, 
                   # do not show progress bar because of knitr
                   progress = F
                  )

Now we want to try predict sentiment, based on review. For that we will use glmnet package, so we have to create Document-Term matrix in dgCMatrix format. It is easy with get_dtm function:

dtm <- get_dtm(corpus = corp, type = "dgCMatrix") %>% 
  # remove very common and very uncommon words
  dtm_transform(filter_commons_transformer, term_freq = c(common = 0.001, uncommon = 0.975)) %>% 
  # make tf-idf transformation
  dtm_transform(tfidf_transformer)
dim(dtm)

## [1] 25000 10067

Cool. We have feature matrix, but don’t have response variable, which is still in the large file (which possibly won’t fit into memory). Fortunately reading particular columns is easy, for example see this stackoverflow discussion. We will use fread() function from data.table package:

library(data.table)
# read only second column - value of sentiment
dt <- fread(path, select = c(2))

So all stuff is ready for model fitting.

library(glmnet)

## Loading required package: foreach
## Loaded glmnet 2.0-2

# I have 4 core machine, so will use parallel backend for n-fold crossvalidation
library(doParallel)

## Loading required package: iterators
## Loading required package: parallel

registerDoParallel(4)
# train logistic regression with 4-fold cross-validation, maximizing AUC
fit <- cv.glmnet(x = dtm, y = dt[['sentiment']], 
                 family = "binomial", type.measure = "auc", 
                 nfolds = 4, parallel = T)
plot(fit)

print (paste("max AUC = ", round(max(fit$cvm), 4)))

## [1] "max AUC =  0.9483"

Not bad! Now lets try to construct dtm using HashCorpus class. Our data is tiny, but for larger data or streaming environments, HashCorpus is natural choice. Read documents and create hash-based corpus:

con <- file(path, open = 'r', blocking = F)
hash_corp <- create_hash_corpus(src = con, 
                           preprocess_fun = preprocess_fun, 
                           # simple_tokenizer - split string by whitespace
                           tokenizer = simple_tokenizer, 
                           # read by batch of 1000 documents
                           batch_size = 1000,
                           # skip first row - header
                           skip = 1,
                           # don't show progress bar because of knitr
                           progress = F)
hash_dtm <- get_dtm(corpus = hash_corp, type = "dgCMatrix") %>% 
  dtm_transform(filter_commons_transformer, term_freq = c(common = 0.001, uncommon = 0.975)) %>% 
  dtm_transform(tfidf_transformer)
# note, that ncol(hash_dtm) > ncol(dtm). Effect of collisions - we can fix this by increasing `hash_size` parameter .
dim(hash_dtm)

## [1] 25000 10107

registerDoParallel(4)
hash_fit <- cv.glmnet(x = hash_dtm, y = dt[['sentiment']], 
                      family = "binomial", type.measure = "auc", 
                      nfolds = 4, parallel = T)
plot(hash_fit)

# near the same result
print (paste("max AUC = ", round(max(hash_fit$cvm), 4)))

## [1] "max AUC =  0.9481"

Future work

Project has issue tracker on github where I’m filing feature requests and notes for future work. Any ideas are very appreciated.

If you like it, you can help:

Test and leave feedback on github issuer tracker (preferably) or directly by email.
- package is tested on linux and OS X platforms, so Windows users are especially welcome
Fork and start contributing. Vignettes, docs, tests, use cases are very welcome.
Or just give me a star on project page :-)

Short-term plans

add tests
add n-gram tokenizers
add methods for tokenization in C++ (at the moment tokenization takes almost half of runtime)
switch to murmur3 hash and add second hash function to reduce probability of collision
push dictionary and stopwords filtering into C++ code

Middle-term plans

add word2vec wrapper. It is strange, that R community still didn’t have it.
add corpus serialization

Long-term plans

integrate models like it is done in gensim
try to implement out-of-core transformations like gensim does

Reasons why I started develop tmlite

All conslusions below are based on personal experience so they can be heavily biased.

First time I started to use tm was end of 2014. I tried to process collection of text dosuments which was less then 1 Gb. About 10000 texts. Surprisingly I wasn’t able to process them on machine with 16 Gb of RAM! But what is really cool - R and all the packages are open source. So I started to examine source code. Unfortunatelly I ended by rewriting most of the package. That first version (anyone interested can browse commits history on github) was quite robust and can handle such tiny-to-medium collections of documents. After that I tried it on some kaggle competitions, but didn’t do any new development, since my work wasn’t related to text analysis and I had no time for that. Also I noted, that almost all text-mining packages in R has tm dependency. We will try to develop an alternative.

About month ago I started full redesign (based on previous experience) and now I rewrote core functions in C++ and want bring alpha version to community.

So why you should not use tm:

tm has a lot of functions - in fact reference manual contains more than 50 pages. But its API is very messy. A lot of packages depends on it , so it is hard redesign it.
tm is not very efficient (from my experience). I found it very slow and what is more important - very RAM unfriendly and RAM-greedy. (I’ll provide few examples below). As I understand it is designed more for academia researchers, then data science practitioners. It perfectly handles metadata, processes different encodings. API is very high-level, but the price for that is performance.
Can only handle documents that fit in RAM. (To be fair I should say, that there is PCorpus() function. But it seems it cannot help with Document-Term matrix construction when size of the documents larger than RAM - see examples below. DocumentTermMatrix() is very RAM-greedy).

Comparison with tm

Some naive benchmarks on Document-Trem matrix construction

Here I’ll provide simple benchmark, which can give some impression about tmlite speed, compared to tm. For now we assume, that documents are already in memory, so we only need to clean text and tokenize it:

library(tm)

## Loading required package: NLP

library(data.table)
library(tmlite)
dt <- fread('~/Downloads/labeledTrainData.tsv')
txt <- dt[['review']]
print(object.size(txt), quote = FALSE, units = "Mb")

## 32.8 Mb

# 32.8 Mb
system.time ( corpus_tm <- VCorpus(VectorSource(txt)) )

##    user  system elapsed 
##   2.081   0.011   2.095

print(object.size(corpus_tm), quote = FALSE, units = "Mb")

## 121.4 Mb

# 121.4 Mb!!!
system.time ( corpus_tm <- tm_map(corpus_tm, content_transformer(simple_preprocess)) )

##    user  system elapsed 
##  10.761   0.281   6.591

system.time ( dtm_tm <- DocumentTermMatrix(corpus_tm, control = list(tokenize = words) ) )

##    user  system elapsed 
##  15.002   0.740  12.227

Now lets check timings for tmlite:

system.time ( corp <- create_dict_corpus(src = txt, 
                   preprocess_fun = simple_preprocess, 
                   # simple_tokenizer - split string by whitespace
                   tokenizer = simple_tokenizer, 
                   # read by batch of 5000 documents
                   batch_size = 5000, 
                   # do not show progress bar because of knitr
                   progress = FALSE) )

##    user  system elapsed 
##  10.127   0.079  10.224

# get in dgTMatrix form, because tm stores dtm matrix in triplet form
system.time ( dtm <- get_dtm(corpus = corp, type = "dgTMatrix"))

##    user  system elapsed 
##   0.042   0.008   0.050

Well, only two times faster. Is it worth the effort? Lets check another example. Here we will use data from excellent Mining massive datasets course. This is quite a large collection of short texts - more than 9 million rows, 500Mb zipped and about 1.4Gb unzipped.

# we will read only small fraction - 200000 rows (~ 42Mb)
txt <- readLines('~/Downloads/sentences.txt', n = 2e5)
print(object.size(txt), quote = FALSE, units = "Mb")

## 41.7 Mb

# 41.7 Mb
# VCorpus is very slow, about 20 sec on my computer
system.time ( corpus_tm <- VCorpus(VectorSource(txt)) )

##    user  system elapsed 
##  19.340   0.204  19.573

print(object.size(corpus_tm), quote = FALSE, units = "Mb")

## 749.8 Mb

# 749.8 Mb!!! wow!
system.time ( corpus_tm <- tm_map(corpus_tm, content_transformer(simple_preprocess)) )

##    user  system elapsed 
##  20.629   1.487  29.161

# 26 sec. To process 42 Mb of text.

But the following is trully absurd. This forks 2 processes (because it uses mclapply internally). Each process uses 1.3Gb of RAM. 2.6 Gb of RAM to process 42 Mb text chunk. And this takes more then 50 sec on my macbook pro with latest core i7 intel chip. In fact it is not possible to process 1 million rows (200Mb) from my macbook pro with 16 gb of RAM.

system.time ( dtm_tm <- DocumentTermMatrix(corpus_tm, control = list(tokenize = words) ) )

##    user  system elapsed 
##  99.256   3.884  53.380

Compare with tmlite:

system.time ( corp <- create_dict_corpus(src = txt, 
                   preprocess_fun = simple_preprocess, 
                   # simple_tokenizer - split string by whitespace
                   tokenizer = simple_tokenizer, 
                   # read by batch of 5000 documents
                   batch_size = 5000, 
                   # do not show progress bar because of knitr
                   progress = F) )

##    user  system elapsed 
##  10.025   0.050  10.081

# only around 9 sec and 120 Mb of ram
system.time ( dtm_tmlite <- get_dtm(corpus = corp, type = "dgTMatrix"))

##    user  system elapsed 
##   0.116   0.016   0.133

# less than 1 second

So here tmlite 8 times faster and what is much more important consumes 20 times less RAM. On large collections of documents speed up will be even more significant.

Document-Term Matrix manipulations

In practice it can be usefull to remove common and uncommon terms. Both packages provide functions for that: removeSparseTerms() in tm and dtm_remove_common_terms in tmlite. Also note, that removeSparseTerms() can only remove uncommon terms, so to be fair we will test only that functionality:

system.time( dtm_tm_reduced <- removeSparseTerms(dtm_tm, 0.99))

##    user  system elapsed 
##   1.422   0.104   1.535

# common = 1 => do not remove common terms
system.time( dtm_tmlite_reduced <- dtm_tmlite %>% 
               dtm_transform(filter_commons_transformer, term_freq = c(common = 0.001, uncommon = 0.975)))

##    user  system elapsed 
##   0.350   0.081   0.431

3-5 times faster - not bad. Now compare tf-idf transformation:

system.time( dtm_tm_tfidf <- weightTfIdf(dtm_tm, normalize = T))

## Warning in weightTfIdf(dtm_tm, normalize = T): empty document(s): 6782
## 26135 26136 26137 26138 26139 26140 26141 26142 26143 26144 26145 27664
## 60895 60896 60897 60898 60899 60900 88953 106921 122685 141442 141443
## 141449 141454 152656

##    user  system elapsed 
##   0.246   0.028   0.274

# common = 1 => do not remove common terms
# timings slightly greate than weightTfIdf, because all transformations optimized for 
# dgCMatrix format, which is standart for sparse matrices in R
system.time( dtm_tmlite_tfidf <- dtm_tmlite %>% 
               dtm_transform(tfidf_transformer))

##    user  system elapsed 
##   0.390   0.091   0.481

# for dtm in dgCMatrix timings should be equal
dtm_tmlite_dgc<-  as(dtm_tmlite, "dgCMatrix")
system.time( dtm_tmlite_tfidf <- dtm_tmlite_dgc %>% 
               dtm_transform(tfidf_transformer))

##    user  system elapsed 
##   0.252   0.049   0.302

Equal timings - great (and surprise for me) - within the last year tm authors have significantly improved its performace!

Working with MS SQL server on non-windows systems

2015-07-16T00:00:00+00:00

As I know, there are few choices to connect from R to MS SQL Server:

But only second option can be used on mac and linux machines. Here is nice stackoverflow thread.

Most of the people suggest to use microsoft sql java driver. But there is a case when this will not help - windows domain authentification. In this situation I found the only working solution is to use nice jTDS. It not only solve this problem, but also outperform Microsoft JDBC Driver.

So to use it you have to:

Install rJava. There are a lot of manuals for diffrent OS on the internet.
Install RJDBC.
Download jTDS from official site. Unpack it.

Now you can easily connect to your source:
(assume jtds-1.3.1, which is unpacked into ~/bin )

drv <- JDBC("net.sourceforge.jtds.jdbc.Driver" , 
            "~/bin/jtds-1.3.1-dist/jtds-1.3.1.jar")
mssql_addr <- "10.0.0.1"
mssql_port <- "1433"
domain <- "YOUR_DOMAIN"
connection_string <- paste0("jdbc:jtds:sqlserver://", mssql_addr, ":", mssql_port, 
                            ";domain=", domain)
conn <- dbConnect(drv, 
                  connection_string, 
                  user = 'user_name', 
                  password = "********")
query <- "select count(*) from your_db.dbo.your_table"
cnt <- dbGetQuery(conn = conn, statement = query)

Installing cuda toolkit and related R packages

2015-06-04T00:00:00+00:00

The main purpose of this post is to keep all steps of installing cuda toolkit (and R related packages) and in one place. Also I hope this may be useful for someone.

Installing cuda toolkit ( Ubuntu )

First of all we need to install nvidia cuda toolkti. I’am on latest ubuntu 15.04, but found this article well suited for me. But there are few additions:

It is very important to have no nvidia drivers before installation ( first I corrupted my system and have to reinstall it :-( ). So I recommend to switch to real terminal (ctrl + alt + f1), remove all nvidia stuff sudo apt-get purge nvidia-* and then follow steps from article above.
This will install cuda toolkit and corresponding nvidia drivers.

wget http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1410/x86_64/cuda-repo-ubuntu1410_7.0-28_amd64.deb
sudo dpkg -i cuda-repo-ubuntu1410_7.0-28_amd64.deb
sudo apt-get update
sudo apt-get install cuda

After installation we need to modify our .bashrc file. Add following lines:

export CUDA_HOME=/usr/local/cuda-7.0
export LD_LIBRARY_PATH=${CUDA_HOME}/lib64

PATH=${CUDA_HOME}/bin:${PATH}
PATH=${CUDA_HOME}/bin/nvcc:${PATH}
export PATH

Note, that I added path to nvcc compiler.

Installing gputools

First simply try:

install.packages('gputools', repos = 'http://cran.rstudio.com/')

After that I recieved:

Unsupported gpu architecture ‘compute_10’

Solving this issue I found this link useful. I have gt525m card and have compute capability 2.1. You can verify your GPU capabilities here. So I downloaded gputools source package:

cd ~
wget http://cran.r-project.org/src/contrib/gputools_0.28.tar.gz
tar -zxvf gputools_0.28.tar.gz

and replace following string

NVCC := $(CUDA_HOME)/bin/nvcc -gencode arch=compute_10,code=sm_10 -gencode arch=compute_13,code=sm_13 -gencode arch=compute_20,code=sm_20 -gencode arch=compute_30,code=sm_30

in gputools/src/Makefile by

NVCC := $(CUDA_HOME)/bin/nvcc -gencode arch=compute_20,code=sm_21

Next try to gzip it back and install from source:

install.packages("~/gputools.tar.gz", repos = NULL, type = "source")

Than I recieved:

rinterface.cu:1:14: fatal error: R.h: No such file or directory #include

We have to adjust R header dir location. First of all look for R.h:

locate \/R.h

replace string R_INC := $(R_HOME)/include in gputools/src/config.mk string by found path: R_INC := /usr/share/R/include

In case we recieve error regarding shared libcublas.so we also need to adjust links for libcublas shared library:

sudo ln -s /usr/local/cuda/lib64/libcublas.so.7.0 /usr/lib/libcublas.so.7.0

thanks to this thread.

Testing performance

here is simple benchmark:

library(gputools)
N <- 1e3
m <- matrix(sample(100, size = N*N, replace = T), nrow = N)
system.time(dist(m))

##    user  system elapsed 
##   4.864   0.008   4.874

system.time(gpuDist(m))

##    user  system elapsed 
##   0.640   0.168   0.809

Locality Sensitive Hashing In R Part 1

2015-01-02T00:00:00+00:00

Introduction

In the next series of posts I will try to explain base concepts Locality Sensitive Hashing technique.

Note, that I will try to follow general functional programming style. So I will use R’s Higher-Order Functions instead of traditional R’s *apply functions family (I suppose this post will be more readable for non R users). Also I will use brilliant pipe operator %>% from magrittr package. We will start with basic concepts, but end with very efficient implementation in R (it is about 100 times faster than python implementations I found).

The problem

Imagine the following interesting problem. We have two very large social netwotks (for example facebook and google+), which have hundreds of millions of profiles and we want to determine profiles owned by same person. One reasonable approach is to assume that these people have nearly same or at least highly overlapped sets of friends in both networks. One well known measure for determining degree of similarity of sets is Jaccard Index:
$J(SET_1, SET_2) = {|SET_1 \cap SET_2|\over |SET_1 \cup SET_2| }$

Set operations are computationally cheap and straightforward solution seems quite good. But let’s try to estimate computational time for duplicates detection for only people with name “John Smith”. Imagine that in average each person has 100 friends:

# for reproducible results
set.seed(seed = 17)
library('microbenchmark')
# we will use brilliant pipe operator %>%
library('magrittr')
jaccard <- function(x, y) {
  set_intersection <- length(intersect(x, y))
  set_union <- length(union(x, y))
  return(set_intersection / set_union)
}
# generate "lastnames"
lastnames <- Map(function(x) paste(sample(letters, 3), collapse = ''), 1:1e5) %>% unique
print(head(lastnames))

## [[1]]
## [1] "eyl"
## 
## [[2]]
## [1] "ukm"
## 
## [[3]]
## [1] "fes"
## 
## [[4]]
## [1] "fka"
## 
## [[5]]
## [1] "vuw"
## 
## [[6]]
## [1] "ypg"

friends_set_1 <- sample(lastnames, 100, replace = F)
friends_set_2 <- sample(lastnames, 100, replace = F)
microbenchmark(jaccard(friends_set_1, friends_set_2))

## Unit: microseconds
##                                   expr    min     lq     mean  median
##  jaccard(friends_set_1, friends_set_2) 45.646 47.417 50.72362 48.4045
##       uq     max neval
##  49.9435 150.343   100

One operation takes 50 microseconds in average (on my machine). If we have 100000 of peoples with name John Smith and we have to compare all pairs, total computation will take more than 100 hours!

hours <- (50 * 1e-6) * 1e5 * 1e5 / 60 / 60
hours

## [1] 138.8889

Of course this is unappropriate because of $O(n^2)$ complexity of our brute-force algorithm.

Minhashing

To solve this kind problem we will use Locality-sensitive hashing - a method of performing probabilistic dimension reduction of high-dimensional data. It provides good tradeoff between accuracy and computational time and roughly speaking has $O(n)$ complexity.
I will explain one scheme of LSH, called MinHash.
The intuition of the method is the following: we will try to hash the input items so that similar items are mapped to the same buckets with high probability (the number of buckets being much smaller than the universe of possible input items).
Let’s construct simple example:

set1 <- c('SMITH', 'JOHNSON', 'WILLIAMS', 'BROWN')
set2 <- c('SMITH', 'JOHNSON', 'BROWN')
set3 <- c('THOMAS', 'MARTINEZ', 'DAVIS')
set_list <- list(set1, set2, set3)

Now we have 3 sets to compare and identify profiles, related to same “John Smith”. From these sets we will construct matrix which encode relations between sets:

sets_dict <- unlist(set_list) %>% unique

m <- Map(f = function(set, dict) as.integer(dict %in% set), 
         set_list, 
         MoreArgs = list(dict = sets_dict)) %>% 
  do.call(what = cbind, .)

# This is equal to more traditional R's sapply call:
# m <- sapply(set_list, FUN = function(set, dict) as.integer(dict %in% set), dict = sets_dict, simplify = T)

dimnames(m) <- list(sets_dict, paste('set', 1:length(set_list), sep = '_'))
print(m)

##          set_1 set_2 set_3
## SMITH        1     1     0
## JOHNSON      1     1     0
## WILLIAMS     1     0     0
## BROWN        1     1     0
## THOMAS       0     0     1
## MARTINEZ     0     0     1
## DAVIS        0     0     1

Let’s call this matrix input-matrix. In our representation similarity of two sets from source array equal to the similarity of two corresponding columns with non-zero rows:

name	set_1	set_2	intersecton	union
SMITH	1	1	+	+
JOHNSON	1	1	+	+
WILLIAMS	1	0	-	+
BROWN	1	1	+	+
THOMAS	0	0	-	-
MARTINEZ	0	0	-	-
DAVIS	0	0	-	-

From table above we can conclude, that jaccard index between set_1 and set_2 is 0.75.
Let’s check:

print(jaccard(set1, set2))

## [1] 0.75

column_jaccard <-  function(c1, c2) {
  non_zero <- which(c1 | c2)
  column_intersect <- sum(c1[non_zero] & c2[non_zero])
  column_union <- length(non_zero)
  return(column_intersect / column_union)
}
isTRUE(jaccard(set1, set2) == column_jaccard(m[, 1], m[, 2]))

## [1] TRUE

All the magic starts here. Suppose random permutation of rows of the input-matrix m. And let’s define minhash function $h(c)$ = # of first row in which column $c == 1$ . If we will use $N$ independent permutations we will end with $N$ minhash functions. So we can construct signature-matrix from input-matrix using these minhash functions. Below we will do it not very efficiently with 2 nested for loops. But the logic should be very clear.

# for our toy example we will pick N = 4
N <- 4
sm <- matrix(data = NA_integer_, nrow = N, ncol = ncol(m))
perms <- matrix(data = NA_integer_, nrow = nrow(m), ncol = N)
# calculate indexes for non-zero entries for each column
non_zero_row_indexes <- apply(m, MARGIN = 2, FUN = function(x) which (x != 0) )
for (i in 1 : N) {
  # calculate permutations
  perm <- sample(nrow(m))
  perms[, i] <- perm
  # fill row of signature matrix
  for (j in 1:ncol(m))
    sm[i, j] <-  min(perm[non_zero_row_indexes[[j]]])
}
print(sm)

##      [,1] [,2] [,3]
## [1,]    3    3    1
## [2,]    1    1    3
## [3,]    1    1    2
## [4,]    1    1    4

You can see how we obtain signature-matrix matrix after “minhash transformation”. Permutations and corresponding signatures marked with same colors:

perm_1	perm_2	perm_3	perm_4	set_1	set_2	set_3
4	1	4	6	1	1	0
3	4	1	1	1	1	0
7	6	6	2	1	0	0
6	2	7	3	1	1	0
5	3	2	5	0	0	1
2	5	3	7	0	0	1
1	7	5	4	0	0	1

set_1	set_2	set_3
3	3	1
1	1	3
1	1	2
1	1	4

You can notice that set_1 and set_2 signatures are very similar and signature of set_3 dissimilar with set_1 and set_2.

jaccard_signatures <-  function(c1, c2) {
  column_intersect <- sum(c1 == c2)
  column_union <- length(c1)
  return(column_intersect / column_union)
}
print(jaccard_signatures(sm[, 1], sm[, 2]))

## [1] 1

print(jaccard_signatures(sm[, 1], sm[, 3]))

## [1] 0

Intuition is very straighforward. Let’s look down the permuted columns $c_1$ and $c_2$ until we detect 1.

If in both columns we find ones - (1, 1), then $h(c_1) = h(c_2)$ .
In case (0, 1) or (1, 0) $h(c_1) \neq h(c_2)$ . So the probability over all permutations of rows that $h(c_1) = h(c_2)$ is the same as $J(c_1, c_2)$ .

Moreover there exist theoretical guaranties for estimation of Jaccard similarity: for any constant $\varepsilon > 0$ there is a constant $k = O(1/\varepsilon^2)$ such that the expected error of the estimate is at most $\varepsilon$ .

Implementation and bottlenecks

Suppose input-matrix is very big, say 1e9 rows. It is quite hard computationally to permute 1 billion rows. Plus you need to store these entries and access these values. It is common to use following scheme instead:

Pick $N$ independent hash functions $h_i(c)$ instead of $N$ premutations, $i = 1..N$ .
For each column $c$ and each hash function $h_i$ , keep a “slot” $M(i, c)$ .
$M(i, c)$ will become the smallest value of $h_i(r)$ for which column $c$ has 1 in row $r$ . I.e., $h_i(r)$ gives order of rows for $i^{th}$ permutation.

So we end up with following ALGORITHM(1) from excellent Mining of Massive Datasets book:

for each row r do begin
  for each hash function hi do
    compute hi (r);
  for each column c
    if c has 1 in row r
      for each hash function hi do
        if hi(r) is smaller than M(i, c) then
          M(i, c) := hi(r);
end;

I highly recommend to watch video about minhashing from Stanford Mining Massive Datasets course.

Summary

Let’s summarize what we have learned from first part of tutorial:

We can construct input-matrix from given list of sets. But actually we didn’t exploit the fact, that input-matrix is very sparse and construct it as R’s regular dense matrix. It is very computationally and RAM inefficient.
We can construct dense signature-matrix from input-matrix. But we only implemented algorithm that is based on permutations and also not very efficient.
We understand theorethical guaranties of our algorithm. They are proportional to number of independent hash functions we will pick. But how will we actually construct this family of functions? How can we efficiently increase number of functions in our family when needed?
Our signature-matrix has small fixed number of rows. Each column represents input set and $J(c_1, c_2)$ ~ $J(set_1, set_2)$ . But we still have $O(n^2)$ complexity, because we need to compair each pair to find duplicate candidates.

In the next posts I will describe how to efficently construct and store input-matrix in sparse format. Then we will discuss how to construct family of hash functions. After that we will implement fast vectorized version of ALGORITHM(1). And finally we will see how to use Locality Sensitive Hashing to determine candidate pairs for similar sets in $O(n)$ time. Stay tuned!

Rmongodb 1.8.0

2014-11-02T00:00:00+00:00

Today I’m introducing new version of rmongodb (which I started to maintain) – v1.8.0. Install it from github:

library(devtools)
install_github("mongosoup/rmongodb@v1.8.0")

Release version will be uploaded to CRAN shortly. This release brings a lot of improvements to rmongodb:

Now rmongodb correctly handles arrays.
- mongo.bson.to.list() rewritten from scratch. R’s unnamed lists are treated as arrays, named lists as objects. Also it has an option – whether to try to simplify vanilla lists to arrays or not.
- mongo.bson.from.list() updated.
mongo.cursor.to.list() rewritten and has slightly changed behavior – it doesn’t produce any type coercions while fetching data from cursor.
mongo.aggregation() has new options to match MongoDB 2.6+ features. Also second argument now called pipeline (as it is called in MongoDB command).
new function mongo.index.TTLcreate() – creating indexes with “time to live” property.
R’s NA values now converted into MongoDB null values.
many bug fixes (including troubles with installation on Windows) – see full list

I want to highlight some of changes.
The first most important is that now rmongodb correctly handles arrays. This issue was very annoying for many users (including me :-). Moreover about half of rmongodb related questions at stackoverflow were caused by this issue. In new version of package, mongo.bson.to.list() is rewritten from scratch and mongo.bson.from.list() fixed. I heavily tested new behaviour and all works very smooth. Still it’s quite big internal change, because these fucntions are workhorses for many other high-level rmongodb functions. Please test it, your feedback is very wellcome. For example here is convertion of complex JSON into BSON using mongo.bson.from.JSON() (which internally call mongo.bson.from.list()):

library(rmongodb)
json_string <- '{"_id": "dummyID", "arr":["string",3.14,[1,"2",[3],{"four":4}],{"mol":42}]}'
bson <- mongo.bson.from.JSON (json_string)

This will produce following MongoDB document: {"_id": "dummyID", "arr":["string",3.14,[1,"2",[3],{"four":4}],{"mol":42}]}
The second one is that mongo.cursor.to.list() has new behaviour: it returns plain list of objects without any coercion. Each element of list corresponds to a document of underlying query result. Additional improvement is that mongo.cursor.to.list() uses R’s environments to avoid extra copying, so now it is much more efficient than previous version (especially when fetching a lot of records from MongoDB).

In the next few releases I have plans to upgrade underlying mongo-c-driver-legacy to latest version 0.8.1.