updated 2016-03-31 – few functions renamed
updated 2016-10-07 – see updated tutorial for text2vec 0.4
Today I’m pleased to announce preview of the new version of text2vec. It is located in the 0.3 development branch, but very soon (probably in about a week) it will be merged into master.
To reproduce examples below, please install [email protected]
from github:
devtools::install_github('dselivanov/[email protected]')
Also I’m waiting for feedback from text2vec users, please spend a few minutes:
- What APIs are not clear / not intuitive?
- What functionality is missing?
- Do you have any problems with speed / RAM usage?
Overview
In two words: text2vec
became faster and more user-friendly. During the work on this version I almost didn’t touch underlying core C++ code and focused on high-level features and usability. First I will briefly describe main improvements and then will provide full-featured example.
In this post i would like to highlight the following improvements:
- important bugfix
dtm
keeps document ids as rownames- several API breaks – some functions removed, some renamed and some have another default arguments
- performance improvements – all core functions have parallel mode
Full list of the features/changes available at github and marked with 0.3 tag.
Bugfix
There was one significant bug: when last document has no terms (at least from vocabulary), i.e. last row of dtm
has all zeros, get_dtm()
function omitted this last row. So dtm
had less rows than number of documents in corpus
. Now fixed.
Preserving document ids in corpus
and dtm
I’m not only the developer of the text2vec
, but also probably the most active user. Since the first public release I felt that I needed to improve some rough edges. One of the most obviously missing things was lack of mechanism for keeping document ids
during corpus
(and dtm
) construction. Now it is straightforward – if input of the itoken
function has names, these names will be used as documents ids
.
New high-level API
In 0.2 corpus
was the central object. We can think about it as a container with reference semantics, which allow us to perform vectorization and collection of terms coocurence statistics simulteniously. After the corpus is created, only the following two functions are useful in 99% of cases – get_dtm
and get_tcm
. After that, users usually work with matrices. This means that corpus
actually is an intermediate object and mainly should be used internally. In real life users usually need Document-Term matrix (dtm) or Term-Cooccurence matrix (tcm) which simplifies the process of transition from raw text to a vector space.
In 0.3 I introduce new higher-level API for direct dtm
and tcm
creation – create_dtm()
and create_tcm()
functions. Such simplification also allows me to implement efficient concurrent growing of dtm
and tcm
. create_dtm()
and create_tcm()
internally use create_corpus()
, but hide all gory details and care about parallel execution. Experienced users, who need simulteniously vectorize corpus and collect cooccurence statistics, can still use create_corpus()
and corresponding get_dtm()
, get_tcm
functions.
Another refinement – is the introduction of vectorizer
concept. vectorizer
is the function which performs mapping from raw text space to vector space. There are 2 kinds of vectorizers:
vocab_vectorizer
which uses vocabulary to perfrom bag-of-ngrams vectorization;hash_vectorizer
which uses feature hashing (or hashing trick);
Iterators
As it was pointed out here, in case of vocabulary vectorization, we perform 2 passes over input source. This means we read, preprocess and tokenize twice. While I/O usually is not an issue (if you use efficient reader like data.table::fread
or functions from readr
package), preprocessing can occupy a significant amount of time. For this reason I created itoken
S3 method which works with list
of character
vectors – list of tokens. Now user can tokenize input and then reuse list of tokens in create_vocabulary
, dtm
, tcm
construction. See examples below.
Vocabulary
There were several improvements to vocabulary construction:
- stopwords filtering during vocanulary construction (especially usefull for ngrams with
n > 1
); create_vocabulary
can be built in parallel using all your CPU cores;prune_vocabulary()
became slightly more efficient – it performs less unnecessary computations;
Transformers
All transformers renamed, now all starts with transform_*
(this was done for more convenient work with autocompletion):
transform_binary
transform_tfidf
transform_tf
transform_filter_commons
still useful, even with some intersection withprune_vocabulary
The following example demonstrates new pipeline with many text2vec features: (note how flexible text2vec can be! thanks to functional style)
library(text2vec)
# for stemming
library(SnowballC)
data("movie_review")
stem_tokenizer <- function(x, tokenizer = word_tokenizer) {
x %>%
tokenizer %>%
# poerter stemmer
lapply(wordStem, 'en')
}
# create list of stemmed tokens
# each element of list is a representation of original document
tokens <- movie_review$review %>%
tolower %>%
stem_tokenizer
# keep document ids in dtm and corpus!
names(tokens) <- movie_review$id
stopwords <- c("i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours") %>%
# here we stem stopwords, because stop-words filtering would be performed after tokenization!
wordStem('en')
it <- itoken(tokens)
vocab <- create_vocabulary(it, ngram = c(1L, 1L), stopwords = stopwords)
# remove common and uncommon words
pruned_vocab = prune_vocabulary(vocab, term_count_min = 5, doc_proportion_max = 0.5)
str(pruned_vocab)
List of 4 $ vocab :Classes 'data.table' and 'data.frame': 9595 obs. of 3 variables: ..$ terms : chr [1:9595] "fiorentino" "bfg" "tadashi" "kabei" ... ..$ terms_counts: int [1:9595] 5 8 5 5 11 5 6 10 6 8 ... ..$ doc_counts : int [1:9595] 1 1 1 1 1 1 1 1 1 1 ... ..- attr(*, ".internal.selfref")=<externalptr> $ ngram : Named int [1:2] 1 1 ..- attr(*, "names")= chr [1:2] "ngram_min" "ngram_max" $ document_count: int 5000 $ stopwords : chr [1:11] "i" "me" "my" "myself" ... - attr(*, "class")= chr "text2vec_vocabulary"
One important note. In current R realization, iterators are mutable. So at this point our iterator is empty:
try(iterators::nextElem(it))
So before corpus
/ dtm
/ tcm
construction we need to reinitialise it. Here we create dtm
directly:
it <- itoken(tokens)
v_vectorizer <- vocab_vectorizer(pruned_vocab)
dtm <- create_dtm(it, v_vectorizer)
# check that dtm keep documents names/ids as rownames
head(rownames(dtm))
[1] "5814_8" "2381_9" "7759_3" "3630_4" "9495_8" "8196_8"
identical(rownames(dtm), movie_review$id)
[1] TRUE
Or tcm
:
it <- itoken(tokens)
cooccurence_vectorizer <- vocab_vectorizer(pruned_vocab, grow_dtm = FALSE, skip_grams_window = 5L)
tcm <- create_tcm(it, cooccurence_vectorizer)
Old-style simultenious vectorization and collection of cooccurence statistics:
it <- itoken(tokens)
v_vectorizer <- vocab_vectorizer(pruned_vocab, grow_dtm = TRUE, skip_grams_window = 5L)
corpus <- create_corpus(it, v_vectorizer)
dtm <- get_dtm(corpus)
tcm <- get_tcm(corpus)
Another option is to use hash_vectorizer
. Procedure is the same:
# create hash vectorizer for unigrams and bigrams
h_vectorizer <- hash_vectorizer(hash_size = 2 ^ 16, ngram = c(1L, 2L))
it <- itoken(tokens)
dtm <- create_dtm(it, h_vectorizer)
Parallel mode
Now create_dtm
, create_tcm
, create_vocabulary
take advantage of multicore machines and do it in transparent manner. In contrast to GloVe fitting which uses low-level thread parallelism via RcppParallel
, other functions use standart R high-level parallelism on top of foreach
package. They are flexible and can use diffrent parallel backends – doParallel
, doRedis
, etc. But user should remember that such high-level parallelism can involve significant overhead.
Only two things user should perform manually to take advantage of multicore machine:
- prepare splits of input data in a form of
list
ofitoken
iterators. - register parallel backend
Here is simple example with timings:
N_WORKERS <- 2
library(doParallel)
library(microbenchmark)
registerDoParallel(N_WORKERS)
# "jobs" is a list of itoken iterators!
N_SPLITS <- 2
jobs <- tokens %>%
split_into(N_SPLITS) %>%
lapply(itoken)
# performance comparison between serial and parallel versions
# vocabulary creation
microbenchmark(
vocab_serial <- create_vocabulary(itoken(tokens), stopwords = stopwords),
vocab_parallel <- create_vocabulary(jobs, stopwords = stopwords),
times = 1
)
Unit: milliseconds expr vocab_serial <- create_vocabulary(itoken(tokens), stopwords = stopwords) vocab_parallel <- create_vocabulary(jobs, stopwords = stopwords) min lq mean median uq max neval 382.0348 382.0348 382.0348 382.0348 382.0348 382.0348 1 254.0068 254.0068 254.0068 254.0068 254.0068 254.0068 1
# dtm vocabulary vectorization
v_vectorizer <- vocab_vectorizer(vocab_serial)
# dtm feature hashing
h_vectorizer <- hash_vectorizer()
# tcm vectorization
tcm_vectorizer <- vocab_vectorizer(vocab_serial, grow_dtm = T, skip_grams_window = 5)
microbenchmark(
vocab_dtm_serial <- create_dtm(itoken(tokens), vectorizer = v_vectorizer),
vocab_dtm_parallel <- create_dtm(jobs, vectorizer = v_vectorizer),
hash_dtm_serial <- create_dtm(itoken(tokens), vectorizer = h_vectorizer),
hash_dtm_parallel <- create_dtm(jobs, vectorizer = h_vectorizer),
tcm_serial <- create_dtm(itoken(tokens), vectorizer = tcm_vectorizer),
tcm_parallel <- create_dtm(jobs, vectorizer = tcm_vectorizer),
times = 1
)
Unit: milliseconds expr vocab_dtm_serial <- create_dtm(itoken(tokens), vectorizer = v_vectorizer) vocab_dtm_parallel <- create_dtm(jobs, vectorizer = v_vectorizer) hash_dtm_serial <- create_dtm(itoken(tokens), vectorizer = h_vectorizer) hash_dtm_parallel <- create_dtm(jobs, vectorizer = h_vectorizer) tcm_serial <- create_dtm(itoken(tokens), vectorizer = tcm_vectorizer) tcm_parallel <- create_dtm(jobs, vectorizer = tcm_vectorizer) min lq mean median uq max neval 1054.9643 1054.9643 1054.9643 1054.9643 1054.9643 1054.9643 1 697.1996 697.1996 697.1996 697.1996 697.1996 697.1996 1 1234.3570 1234.3570 1234.3570 1234.3570 1234.3570 1234.3570 1 592.7327 592.7327 592.7327 592.7327 592.7327 592.7327 1 3136.1603 3136.1603 3136.1603 3136.1603 3136.1603 3136.1603 1 1780.9763 1780.9763 1780.9763 1780.9763 1780.9763 1780.9763 1
As you can see, speedup is not perfect. This happened because, R’s high-level parallelism has significant overhead on small tasks. On larger tasks you can expect almost linear speedup!
Bonus: how fast is fast?
On 16-core machine I was able to perform vectorization (unigrams) of english wikipedia (13 gb of text, 4M of documents) in 2.5 minutes using hash vectorizer and in 6 minutes using vocabulary vectorizer. Timings include time spent for reading from disk! Resulted dtm
was about 13gb and at peak R processes consumes about 30gb of RAM. (Try to do it with any other R package or python module).
Here is code:
library(text2vec)
library(data.table)
library(doParallel)
registerDoParallel(16)
start <- Sys.time()
# tab-separated wikipedia "article_title t article_body"
# article_body is "single splace" separated
reader <- function(x) {
fread(x, sep = 't', header = F, select = 2, colClasses = rep('character', 2))[[1]]
}
# each file is roughly 100mb
fls <- list.files("~/datasets/enwiki_splits/", full.names = T)
# jobs are simply list of itoken iterators. Each element is separate job in a separate process.
# after finish the will be efficiently combined. (especially efficiently in case of `dgTMatrix`)
jobs <- fls %>%
# combine files into 64 groups, so we will have 64 jobs
split_into(64) %>%
lapply(function(x) x %>% ifiles(reader_function = reader) %>% itoken)
# alternatively can process each file as separate job
# jobs <- lapply(fls, function(x) x %>% ifiles(reader_function = reader) %>% itoken)
v <- create_vocabulary(jobs) %>%
prune_vocabulary(term_count_min = 10, doc_proportion_max = 0.2)
dtm <- create_dtm(jobs, vocab_vectorizer(v), type = 'dgTMatrix')
finish <- Sys.time()
Updates
- updated 2016-03-31: a few synatax improvements, to be consistenr with Hadley’s style guide – all function names are verbs:
vocabulary
->create_vocabulary
tranformer_*
->tranform_*