Today I will start to publish series of posts about experiments on english wikipedia. As I said before, text2vec is inspired by gensim – well designed and quite efficient python library for topic modeling and related NLP tasks. Also I found very useful Radim’s posts, where he tried to evaluate some algorithms on english wikipedia dump. This dataset is rather big. For example, dump for 2015-10 (which will be used below) is 12gb bzip2 compressed file. In uncompressed form it takes about 50gb. So I can’t call it a “toy” dataset 🙂 You can download original files here. We are interested in file which ends with “pages-articles.xml.bz2”.

All evaluation and timings were done on my macbook laptop with intel core i7 cpu and 16gb of ram.

You can find all the code in the post repository.

Preparation

After getting enwiki dump we should clean it – remove wiki xml markup. I didn’t implement this stage in text2vec, so we will use gensim’s scripts – and especially file prepare_shootout.py. It is not very hard to implement it in R, but this is not top priority for me at the moment. So if anybody is willing to help – please see this issue.

After cleaning we will have “title_tokens.txt.gz” file, which represents wikipedia articles – one article per line. Also each line consists of two tab-separated("t") parts – title of the article and text of the article. Texts consists of space-separated (" ") words in lowercase.

R I/O tricks

R’s base::readLines() is very generic function to read lines of characters from files/connections. And because ot that, readLines() is very slow. So in text2vec I use readr::read_lines() which more then 10x faster. readr is a relatively new package and it has one significant drawback – it doesn’t have streaming API. This means you can’t read file line-by-line – you can only read whole file in a single function call. Sometimes this can become an issue, but usually not – user can manually split big file into chunks using command line tools and work with them. Moreover, if your perform analysis on really large amounts of data, you probably use Apache Spark/Hadoop to prepare input. And usually data is stored in chunks of 64/128Mb in hdfs, so it is very natural to work with such chunks instead of single file.

For this post, I splitted title_tokens.txt.gz into 100mb chunks using split command utility:

gunzip -c title_tokens.txt.gz | split --line-bytes=100m --filter='gzip --fast > ~/Downloads/datasets/$FILE.gz'

If you are on OS X, install coreutils first: brew install coreutils and use gsplit command:

gunzip -c title_tokens.txt.gz | gsplit --line-bytes=100m --filter='gzip --fast > ~/Downloads/datasets/$FILE.gz'

In all the code below we will use title_tokens.txt.gz file as input for gesnim and title_tokens_splits/ directory as input for text2vec.

Word embeddings

Here I want to demonstrate how to use text2vec’s GloVe implementation and briefly compare its performance with word2vec. Originally I had plans to implement word2vec, but after reviewing GloVe paper, I changed my mind. If you still haven’t read it, I strongly recommend to do that.

So, this post has several goals:

Demonstrate how to process large collections of documents (that don’t fit into RAM) with text2vec.
Provide tutorial on text2vec GloVe word embeddings functionality.
Compare text2vec GloVe and gensim word2vec in terms of:
1. accuracy
2. execution time
3. RAM consumption
Briefly highlight advantages and drawbacks of current implementation. (I’ll write separate post with more details about technical aspects.)

Baseline

Here we will follow excellent Radim’s Making sense of word2vec post and try to replicate his results.

Just to remind results

Warning: package 'ggplot2' was built under R version 3.2.3

plot of chunk unnamed-chunk-3

You can find corresponding original repository here.

Modifications

I made a few minor modifications in Radim’s code.

I don’t evaluate glove-python for the following reasons:
1. Radim uses dense numpy matrix to store cooccurencies. While it is great for 30K vocabulary (float32 dense matrix occupies ~ 3.6gb and it takes less time to fill it), it is not appropriate for larger vocabularies (for example float32 matrix for 100K vocabulary will occupy ~ 40gb).
2. Orginal glove-python creates sparse cooccurence matrix, but for some reason it has very poor performance (accuracy on analogue task ~1-2%). I’m not very familiar with python, so can’t figure out what is wrong. If somebody can fix this issue – let me know, I would be happy to add glove-python to this comparison.
Construct vocabulary from top 30k words produced by text2vec vocabulary builder. gensim takes into account title of the article, which can contain upper-case words, punctuation, etc. I found that models which are based on vocabulary constructed from only articles body (not incuding title) are more accurate. This is true for both, GloVe and word2vec.

Building the model

I will focus on text2vec details here, because gensim word2vec code is almost the same as in Radim’s post (again – all code you can find in this repo).

Install test2vec from github:

devtools::install_github('dselivanov/text2vec')

Vocabulary

First of all we need to build a vocabulary:

library(text2vec)
# create iterator over files in directory
it <- idir('~/Downloads/datasets/splits/')
# create iterator over tokens
it2 <- itoken(it, 
              preprocess_function = function(x) 
                str_split(x, fixed("t")) %>% 
                # select only the body of the article
                sapply(.subset2, 2), 
              tokenizer = function(x) str_split(x, fixed(" ")))
vocab <- vocabulary(it2)

On my machine it takes about 1150 sec, while gensim gensim.corpora.Dictionary() takes about 2100 sec. RAW I/O is about ~ 150 sec. plot of chunk unnamed-chunk-6

Pruning vocabulary

We received all unique words and corresponding statistics:

str(vocab)
##List of 2
## $ vocab:'data.frame':	8306153 obs. of  4 variables:
##  ..$ terms          : chr [1:8306153] "bonnerj" "beerworthc" "danielst" "anchaka" ...
##  ..$ terms_counts   : int [1:8306153] 1 1 1 1 1 1 1 1 1 1 ...
##  ..$ doc_counts     : int [1:8306153] 1 1 1 1 1 1 1 1 1 1 ...
##  ..$ doc_proportions: num [1:8306153] 2.55e-07 2.55e-07 2.55e-07 2.55e-07 2.55e-07 ...
## $ ngram: Named int [1:2] 1 1
##  ..- attr(*, "names")= chr [1:2] "ngram_min" "ngram_max"
## - attr(*, "class")= chr "text2vec_vocabulary"

But we are interested in only frequent words, so we should filter out rare words. text2vec provides prune_vocabulary() function which has many useful options for that:

TOKEN_LIMIT = 30000L
# filter out tokens which are represented at least in 30% of documents
TOKEN_DOC_PROPORTION = 0.3
pruned_vocab <- prune_vocabulary(vocabulary = vocab, 
                                     doc_proportion_max = TOKEN_DOC_PROPORTION, 
                                     max_number_of_terms = TOKEN_LIMIT)
# save to csv to use in gensim word2vec
write.table(x = data.frame("word" = pruned_vocab$vocab$terms, "id" = 0:(TOKEN_LIMIT - 1)), 
            file = "/path/to/destination/dir/pruned_vocab.csv",
            quote = F, sep = ',', row.names = F, col.names = F)

Corpus construction

Now we have vocabulary and can construct term-cooccurence matrix (tcm).

WINDOW = 10L
# create iterator over files in directory
it <- idir('~/Downloads/datasets/splits/')
# create iterator over tokens
it2 <- itoken(it, 
              preprocess_function = function(x) 
                str_split(x, fixed("t")) %>% 
                # select only the body of the article
                sapply(.subset2, 2), 
              tokenizer = function(x) str_split(x, fixed(" ")))
# create_vocab_corpus can construct documen-term matrix and term-cooccurence matrix simultaneously
# here we are not interesting in documen-term matrix, so set `grow_dtm = FALSE`
corpus <- create_vocab_corpus(it2, vocabulary = pruned_vocab, grow_dtm = FALSE, skip_grams_window = WINDOW)
# in this call, we wrap std::unordered_map into R's dgTMatrix
tcm <- get_tcm(corpus)

Operation above takes about 80 minutes on my machine and at peak consumes about 11gb of RAM.

Short note on memory consumption

At the moment, during corpus construction, text2vec keeps entire term-cooccurence matrix in memory. In future versions it can be changed (Quite easily via simple map-reduce style algorithm. Exactly the same way it done in original Stanford implementation).

As you can see memory consumption is rather high. But if we do some basic calculations we will realize that memory consumption is not so high:

Internally text2vec stores tcm as std::unordered_map<std::pair<uint32_t, uint32_t>, float>.
We store only elements which are above main diagonal, because our matrix is symmetric.
tcm above consists of ~ 200e6 elements. Matrix is quite dense – ~ 22% non-zero elements (above diagonal! in case of full storage, matrix will be ~ 44% dense).

So, 200e6 * (4 + 4 + 8) = ~ 3.2 gb – only memory to store our matrix in sparse triplet form using preallocated vectors. Also we should add usual 3-4x std::unordered_map overhead and memory allocated for wrapping unordered_map into R sparse triplet dgTMatrix.

GloVe training

We perform GloVe fitting using AdaGrad – stochastic gradient descend with per-feature adaptive learning rate. Also, fitting is done in fully parallel and asynchronous manner ( see Hogwild! paper ), so it can benefit from machines with multiple cores. In my tests I achieved almost 8x speedup on 8 core machine on the discribed above wikipedia dataset.

Now we are ready to train our GloVe model. Here we will perform maximum 20 iterations. Also we will track our global cost and its improvement over iterations. We will stop fitting when improvement (in relation to previous epoch) will become smaller than given threshold – convergence_threshold.

DIM = 600L
X_MAX = 100L
WORKERS = 4L
NUM_ITERS = 20L
CONVERGENCE_THRESHOLD = 0.005
LEARNING_RATE = 0.15
# explicitly set number of threads
RcppParallel::setThreadOptions(numThreads = WORKERS)
fit <- glove(tcm = tcm, 
             word_vectors_size = DIM, 
             num_iters = NUM_ITERS,
             learning_rate = LEARNING_RATE,
             x_max = X_MAX, 
             shuffle_seed = 42L, 
             # we will stop if global cost will be reduced less then 0.5% then previous SGD iteration
             convergence_threshold = CONVERGENCE_THRESHOLD)

This takes about 431 minutes on my machine and stops on 20 iteration (no early stopping):

2015-12-01 06:37:27 – epoch 20, expected cost 0.0145

Accuracy on analogue dataset is 0.759:

words <- rownames(tcm)
m <- fit$word_vectors$w_i + fit$word_vectors$w_j
rownames(m) <-  words
questions_file <- '~/Downloads/datasets/questions-words.txt'
qlst <- prepare_analogue_questions(questions_file, rownames(m))
res <- check_analogue_accuracy(questions_lst = qlst, m_word_vectors = m)

2015-12-01 06:48:23 – capital-common-countries: correct 476 out of 506, accuracy = 0.9407
2015-12-01 06:48:27 – capital-world: correct 2265 out of 2359, accuracy = 0.9602
2015-12-01 06:48:28 – currency: correct 4 out of 86, accuracy = 0.0465
2015-12-01 06:48:31 – city-in-state: correct 1828 out of 2330, accuracy = 0.7845
2015-12-01 06:48:32 – family: correct 272 out of 306, accuracy = 0.8889
2015-12-01 06:48:33 – gram1-adjective-to-adverb: correct 179 out of 650, accuracy = 0.2754
2015-12-01 06:48:33 – gram2-opposite: correct 131 out of 272, accuracy = 0.4816
2015-12-01 06:48:34 – gram3-comparative: correct 806 out of 930, accuracy = 0.8667
2015-12-01 06:48:35 – gram4-superlative: correct 279 out of 506, accuracy = 0.5514
2015-12-01 06:48:37 – gram5-present-participle: correct 445 out of 870, accuracy = 0.5115
2015-12-01 06:48:39 – gram6-nationality-adjective: correct 1364 out of 1371, accuracy = 0.9949
2015-12-01 06:48:41 – gram7-past-tense: correct 836 out of 1406, accuracy = 0.5946
2015-12-01 06:48:42 – gram8-plural: correct 833 out of 1056, accuracy = 0.7888
2015-12-01 06:48:43 – gram9-plural-verbs: correct 341 out of 600, accuracy = 0.5683
2015-12-01 06:48:43 – OVERALL ACCURACY = 0.7593

Note, that sometimes AdaGrad converges to poorer local minima with larger cost. This means model will produce less accurate predictions. For example in some experiments while writing this post I stopped with cost = 0.190 and accuracy = ~ 0.72. Also fitting can be sensitive to initial learning rate, some experiments still needed.

Training word2vec takes 401 minutes and accuracy = 0.687.

As we can see, GloVe shows significantly better accuaracy.

plot of chunk unnamed-chunk-12

Closer look to resources usage:

plot of chunk unnamed-chunk-13

Faster training

If you are more interested in training time you can do the following:

Make more zeros in tcm by removing too rare cooccurences:

ind <- tcm@x >= 1
tcm@x <- tcm@x[ind]
tcm@i <- tcm@i[ind]
tcm@j <- tcm@j[ind]

10 iterations with lower word vector dimensions (DIM = 300 for example).

    DIM <- 300L
    NUM_ITERS <- 10
    fit <- glove(tcm = tcm, 
                 word_vectors_size = DIM, 
                 num_iters = NUM_ITERS,
                 learning_rate = 0.1,
                 x_max = X_MAX, 
                 shuffle_seed = 42L, 
                 max_cost = 10,
                 # we will stop if global cost will be reduced less then 1% then previous SGD iteration
                 convergence_threshold = 0.01)
    words <- rownames(tcm)
    m <- fit$word_vectors$w_i + fit$word_vectors$w_j
    rownames(m) <-  words
    questions_file <- '~/Downloads/datasets/questions-words.txt'
    qlst <- prepare_analogue_questions(questions_file, rownames(m))
    res <- check_analogue_accuracy(questions_lst = qlst, m_word_vectors = m)

Training takes 50 minutes on 4-core machine and get ~68% accuracy:

2015-11-30 15:13:51 – capital-common-countries: correct 482 out of 506, accuracy = 0.9526
2015-11-30 15:13:54 – capital-world: correct 2235 out of 2359, accuracy = 0.9474
2015-11-30 15:13:54 – currency: correct 1 out of 86, accuracy = 0.0116
2015-11-30 15:13:57 – city-in-state: correct 1540 out of 2330, accuracy = 0.6609
2015-11-30 15:13:57 – family: correct 247 out of 306, accuracy = 0.8072
2015-11-30 15:13:58 – gram1-adjective-to-adverb: correct 142 out of 650, accuracy = 0.2185
2015-11-30 15:13:58 – gram2-opposite: correct 87 out of 272, accuracy = 0.3199
2015-11-30 15:13:59 – gram3-comparative: correct 663 out of 930, accuracy = 0.7129
2015-11-30 15:14:00 – gram4-superlative: correct 171 out of 506, accuracy = 0.3379
2015-11-30 15:14:01 – gram5-present-participle: correct 421 out of 870, accuracy = 0.4839
2015-11-30 15:14:03 – gram6-nationality-adjective: correct 1340 out of 1371, accuracy = 0.9774
2015-11-30 15:14:04 – gram7-past-tense: correct 608 out of 1406, accuracy = 0.4324
2015-11-30 15:14:06 – gram8-plural: correct 771 out of 1056, accuracy = 0.7301
2015-11-30 15:14:06 – gram9-plural-verbs: correct 266 out of 600, accuracy = 0.4433
2015-11-30 15:14:06 – OVERALL ACCURACY = 0.6774

plot of chunk unnamed-chunk-16

plot of chunk unnamed-chunk-17

Summary

Advantages

As we see text2vec’s GloVe implementation looks like a good alternative to word2vec and outperforms it in terms of accuracy and running time (we can pick a set of parameters on which it will be both faster and more accurate).
Early stopping. We can stop training when improvements become small.
tcm is reusable. May be it is more fair to subtract timings for tcm_creation from benchmarks above.
Incremental fitting. You can easily adjust tcm with new data and continue fitting. And it will converge very quickly.
text2vec works on OS X, Linux and even Windows without any tricks/adjustments/manual configuration. Thanks to Intel Thread Building Blocks and RcppParallel. It was a little bit simpler to program AdaGrad using OpenMP (as I actually did in my first attempt), but this leads to issues at installation time, especially on OS X and Windows machines.

Drawbacks and what can be improved

One drawback – it uses a lot of memory (in contrast to gensim which is very memory-friendly. I was very impressed.). But this is natural – the fastest way to costruct tcm is to keep it in RAM as hash map and perform cooccurence increments in a global manner. Also note that we build tcm on top 30000 terms. Because of that it is very dense. I tried to build model on top 100000 terms and had no problems on machine with 32gb RAM. Matrix was much more sparse – ~ 4% of non zero elements. Anyway, one can implement simple map-reduce style algorithm to construct tcm. (using files as it done in original Stanford implemention) and then fit model in streaming manner.
Sometimes model is quite sensitive to initial learning rate.

GloVe vs word2vec revisited.