<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
 
 <title>Data Science Notes</title>
 <link href="http://dsnotes.com/" rel="self"/>
 <updated>2016-01-09T21:44:43+00:00</updated>
 <id>http://dsnotes.com</id>
 <author>
   <name>Dmitriy Selivanov</name>
   <email>selivanov.dmitriy@gmail.com</email>
 </author>

 
 <entry>
   <title>text2vec implementation details. Writing fast parallel asynchronous SGD/AdaGrad.</title>
   <link href="http://dsnotes.com/blog/text2vec/2016/01/09/fast-parallel-async-adagrad"/>
   <updated>2016-01-09T00:00:00+00:00</updated>
   <id>http://dsnotes.com/blog/text2vec/2016/01/09/fast-parallel-async-adagrad</id>
   <content type="html">&lt;p&gt;Before reading this post, I very recommend to read:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Orignal &lt;a href=&quot;http://www-nlp.stanford.edu/projects/glove/glove.pdf&quot;&gt;GloVe paper&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://www.foldl.me/2014/glove-python/&quot;&gt;Jon Gauthier’s post&lt;/a&gt;, which provides detailed explanation of python implementation. This post helps me a lot with C++ implementation.&lt;/li&gt;
&lt;/ol&gt;

&lt;h1 id=&quot;word-embedding&quot;&gt;Word embedding&lt;/h1&gt;

&lt;p&gt;After Tomas Mikolov et al. released &lt;a href=&quot;https://code.google.com/p/word2vec/&quot;&gt;word2vec&lt;/a&gt; tool, there was a boom of articles about words vector representations. One of the greatest is &lt;a href=&quot;http://nlp.stanford.edu/projects/glove/&quot;&gt;GloVe&lt;/a&gt;, which did a big thing while explaining how such algorithms work and refolmulating word2vec optimizations as special kind of factoriazation for word cooccurences matrix.&lt;/p&gt;

&lt;p&gt;This post will consists of two main parts:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Very brief introduction into GloVe algorithm.&lt;/li&gt;
  &lt;li&gt;Details of implementation. I will show how to write fast, parallel asynchronous SGD with RcppParallel with adaptive learning rate in C++ using Intel TBB and &lt;a href=&quot;http://rcppcore.github.io/RcppParallel/&quot;&gt;RcppParallel&lt;/a&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;h1 id=&quot;introduction-to-glove-algorithm&quot;&gt;Introduction to GloVe algorithm&lt;/h1&gt;

&lt;p&gt;GloVe algorithm consists of following steps:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Collect word cooccurence statistics in a form of word coocurence matrix &lt;script type=&quot;math/tex&quot;&gt;X&lt;/script&gt;. Each element &lt;script type=&quot;math/tex&quot;&gt;X_{ij}&lt;/script&gt; of such matrix represents measure of how often &lt;em&gt;word i&lt;/em&gt; appears in context of &lt;em&gt;word j&lt;/em&gt;. Usually we scan our corpus in followinf manner: for each term we look for context terms withing some area - &lt;em&gt;window_size&lt;/em&gt; before and &lt;em&gt;window_size&lt;/em&gt; after. Also we give less weight for more distand words (usually  &lt;script type=&quot;math/tex&quot;&gt;decay = 1/offset&lt;/script&gt;).&lt;/li&gt;
  &lt;li&gt;Define soft constraint for each word pair: &lt;script type=&quot;math/tex&quot;&gt;w_i^Tw_j + b_i + b_j = log(X_{ij})&lt;/script&gt;. Here &lt;script type=&quot;math/tex&quot;&gt;w_i&lt;/script&gt; - vector for main word, &lt;script type=&quot;math/tex&quot;&gt;w_j&lt;/script&gt; - vector for context word, &lt;script type=&quot;math/tex&quot;&gt;b_i&lt;/script&gt;, &lt;script type=&quot;math/tex&quot;&gt;b_j&lt;/script&gt; - scalar biases for main and context words.&lt;/li&gt;
  &lt;li&gt;Define cost function &lt;script type=&quot;math/tex&quot;&gt;J = \sum_{i=1}^V \sum_{j=1}^V \; f(X_{ij}) ( w_i^T w_j + b_i + b_j - \log X_{ij})^2&lt;/script&gt;. Here &lt;script type=&quot;math/tex&quot;&gt;f&lt;/script&gt; is a weighting function which help us to prevent learning only from exremly common word pairs. GloVe authors choose following fucntion:&lt;/li&gt;
&lt;/ol&gt;

&lt;script type=&quot;math/tex; mode=display&quot;&gt;% &lt;![CDATA[
f(X_{ij}) = 
\begin{cases}
(\frac{X_{ij}}{x_{max}})^\alpha &amp; \text{if } X_{ij} &lt; XMAX \\
1 &amp; \text{otherwise}
\end{cases} %]]&gt;&lt;/script&gt;

&lt;h1 id=&quot;implementation&quot;&gt;Implementation&lt;/h1&gt;
&lt;p&gt;Main challenges I faced during implementation:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Efficient cooccurence matrix creation.&lt;/li&gt;
  &lt;li&gt;Implementation of efficient SGD for objective cost function minimization.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;cooccurence-matrix-creation&quot;&gt;Cooccurence matrix creation&lt;/h2&gt;
&lt;p&gt;There are a few main issues with term cooccurence matrix (&lt;em&gt;tcm&lt;/em&gt;):&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;em&gt;tcm&lt;/em&gt; should be sparse. We should be able to construct &lt;em&gt;tcm&lt;/em&gt; for large vocabularies ( &amp;gt; 100k words).&lt;/li&gt;
  &lt;li&gt;Fast lookups/inserts.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;To meet requirement of sparsity we need to store data in associative array. &lt;code&gt;unordered_map&lt;/code&gt; is good candidate because of &lt;script type=&quot;math/tex&quot;&gt;O(1)&lt;/script&gt; lookups/inserts complexity. I ended with &lt;code&gt;std::unordered_map&amp;lt; std::pair&amp;lt;uint32_t, uint32_t&amp;gt;, T &amp;gt;&lt;/code&gt; as container for sparse matrix in triplet form. Performance of &lt;code&gt;unordered_map&lt;/code&gt; heavily depends on underlying hash fucntion. Fortunately, we can pack &lt;code&gt;pair&amp;lt;uint32_t, uint32_t&amp;gt;&lt;/code&gt; into single &lt;code&gt;uint64_t&lt;/code&gt; in a determenistic way without any collisions.&lt;br /&gt;
Hash function for &lt;code&gt;std::pair&amp;lt;uint32_t, uint32_t&amp;gt;&lt;/code&gt; will look like:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-cpp&quot; data-lang=&quot;cpp&quot;&gt;&lt;span class=&quot;k&quot;&gt;namespace&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;std&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;template&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&amp;gt;&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;hash&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pair&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;uint32_t&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;uint32_t&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;kr&quot;&gt;inline&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;uint64_t&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;operator&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pair&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;uint32_t&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;uint32_t&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;k&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
      &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;uint64_t&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;k&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;first&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;32&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;|&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;k&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;second&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;};&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;For details see &lt;a href=&quot;http://stackoverflow.com/a/24693169/1069256&quot;&gt;this&lt;/a&gt; and &lt;a href=&quot;http://stackoverflow.com/questions/2768890&quot;&gt;this&lt;/a&gt; stackoverflow questions.&lt;/p&gt;

&lt;p&gt;Also note, that our cooccurence matrix is symmetrical, so internally we will store only elements above main diagonal.&lt;/p&gt;

&lt;h2 id=&quot;stochastic-gradient-descent&quot;&gt;Stochastic gradient descent&lt;/h2&gt;

&lt;p&gt;Now we should implement efficient parallel asynchronous stochastic gradient descent for word cooccurence matrix factorization, which is proposed in &lt;a href=&quot;http://nlp.stanford.edu/projects/glove/&quot;&gt;GloVe&lt;/a&gt; paper. Interesting thing - SGD inherently is serial algoritms, but when your problem is sparse, you can do asynchronous updates without any locks and achieve speedup proportional to number of cores on your machine! If you didn’t read &lt;a href=&quot;https://www.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf&quot;&gt;HOGWILD!&lt;/a&gt;, I recomment to do that.&lt;/p&gt;

&lt;p&gt;Let me remind formulation of SGD. We try to move &lt;script type=&quot;math/tex&quot;&gt;x_t&lt;/script&gt; parameters in a minimizing direction, given by &lt;script type=&quot;math/tex&quot;&gt;−g_t&lt;/script&gt; with a learning rate &lt;script type=&quot;math/tex&quot;&gt;\alpha&lt;/script&gt;:&lt;/p&gt;

&lt;script type=&quot;math/tex; mode=display&quot;&gt;x_{t+1} = x_t − \alpha g_t&lt;/script&gt;

&lt;p&gt;So, we have to calculate gradients for our cost function:&lt;/p&gt;

&lt;p&gt;&lt;script type=&quot;math/tex&quot;&gt;J = \sum_{i=1}^V \sum_{j=1}^V f(X_{ij}) ( w_i^T w_j + b_i + b_j - \log X_{ij} )^2&lt;/script&gt;:&lt;/p&gt;

&lt;script type=&quot;math/tex; mode=display&quot;&gt;\frac{\partial J}{\partial w_i} = f(X_{ij}) w_j ( w_i^T w_j + b_i + b_j - \log X_{ij})&lt;/script&gt;

&lt;script type=&quot;math/tex; mode=display&quot;&gt;\frac{\partial J}{\partial w_j} = f(X_{ij}) w_i ( w_i^T w_j + b_i + b_j - \log X_{ij})&lt;/script&gt;

&lt;script type=&quot;math/tex; mode=display&quot;&gt;\frac{\partial J}{\partial b_i} = f(X_{ij}) (w_i^T w_j + b_i + b_j - \log X_{ij})&lt;/script&gt;

&lt;script type=&quot;math/tex; mode=display&quot;&gt;\frac{\partial J}{\partial b_j} = f(X_{ij}) (w_i^T w_j + b_i + b_j - \log X_{ij})&lt;/script&gt;

&lt;h2 id=&quot;adagrad&quot;&gt;AdaGrad&lt;/h2&gt;

&lt;p&gt;We will use modification of SGD - &lt;a href=&quot;http://www.magicbroom.info/Papers/DuchiHaSi10.pdf&quot;&gt;AdaGrad&lt;/a&gt; algoritm. It automaticaly determines per-feature learning rate by tracking historical gradients, so that frequently occurring features
in the gradients get small learning rates and infrequent features get higher ones. For AdaGrad implementation deteails see excellents &lt;a href=&quot;http://www.ark.cs.cmu.edu/cdyer/adagrad.pdf&quot;&gt;Notes on AdaGrad&lt;/a&gt; by Chris Dyer.&lt;/p&gt;

&lt;p&gt;Formulation of AdaGrad step &lt;script type=&quot;math/tex&quot;&gt;t&lt;/script&gt; and feature &lt;script type=&quot;math/tex&quot;&gt;i&lt;/script&gt; is following:&lt;/p&gt;

&lt;script type=&quot;math/tex; mode=display&quot;&gt;x_{t+1, i} = x_{t, i} − \frac{\alpha}{\sqrt{\sum_{\tau=1}^{t-1}} g_{\tau,i}^2} g_{t,i}&lt;/script&gt;

&lt;p&gt;As we can see, at each iteration &lt;script type=&quot;math/tex&quot;&gt;t&lt;/script&gt; we need to keep track of sum over all historical gradients.&lt;/p&gt;

&lt;h2 id=&quot;parallel-asynchronous-adagrad&quot;&gt;Parallel asynchronous AdaGrad&lt;/h2&gt;

&lt;p&gt;Actually we will use modification of AdaGrad - &lt;em&gt;HOGWILD-style&lt;/em&gt; asynchronous AdaGrad :-) Main idea of &lt;em&gt;HOGWILD!&lt;/em&gt; algorithm is very simple - don’t use any syncronizations. If your problem is sparse, allow threads to overwrite each other! This works and works fine. Again, see &lt;a href=&quot;http://www.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf&quot;&gt;HOGWILD!&lt;/a&gt; paper for details and theoretical proof.&lt;/p&gt;

&lt;h2 id=&quot;code&quot;&gt;Code&lt;/h2&gt;

&lt;p&gt;Now lets put all into the code.&lt;/p&gt;

&lt;p&gt;As seen from analysis above, &lt;code&gt;GloveFit&lt;/code&gt; class should consists following parameters:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;word vecvors &lt;code&gt;w_i&lt;/code&gt;, &lt;code&gt;w_j&lt;/code&gt; (for main and context words).&lt;/li&gt;
  &lt;li&gt;biases &lt;code&gt;b_i&lt;/code&gt;, &lt;code&gt;b_j&lt;/code&gt;.&lt;/li&gt;
  &lt;li&gt;word vectors square gradients &lt;code&gt;grad_sq_w_i&lt;/code&gt;, &lt;code&gt;grad_sq_w_j&lt;/code&gt; for adaptive learning rates.&lt;/li&gt;
  &lt;li&gt;word biases square gradients &lt;code&gt;grad_sq_b_i&lt;/code&gt;, &lt;code&gt;grad_sq_b_j&lt;/code&gt; for adaptive learning rates.&lt;/li&gt;
  &lt;li&gt;&lt;code&gt;lerning_rate&lt;/code&gt;, &lt;code&gt;max_cost&lt;/code&gt; and other scalar model parameters.&lt;/li&gt;
&lt;/ol&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-cpp&quot; data-lang=&quot;cpp&quot;&gt;&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;GloveFit&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;private&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;kt&quot;&gt;size_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;vocab_size&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;word_vec_size&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;kt&quot;&gt;double&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x_max&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;learning_rate&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;// see https://github.com/maciejkula/glove-python/pull/9#issuecomment-68058795&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;// clips the cost for numerical stability&lt;/span&gt;
  &lt;span class=&quot;kt&quot;&gt;double&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;max_cost&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;// initial learning rate&lt;/span&gt;
  &lt;span class=&quot;kt&quot;&gt;double&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;alpha&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;// word vecrtors&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;vector&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;vector&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;w_i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;w_j&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;// word biases&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;vector&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;b_i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;b_j&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;// word vectors square gradients&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;vector&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;vector&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;grad_sq_w_i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;grad_sq_w_j&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;// word biases square gradients&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;vector&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;grad_sq_b_i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;grad_sq_b_j&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3 id=&quot;single-iteration&quot;&gt;Single iteration&lt;/h3&gt;

&lt;p&gt;Now we should to &lt;a href=&quot;https://github.com/dselivanov/text2vec/blob/master/src/GloveFit.h#L8-L41&quot;&gt;initialize&lt;/a&gt; parameters and perform iteration of SGD:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sh&quot; data-lang=&quot;sh&quot;&gt;//init cost
&lt;span class=&quot;nv&quot;&gt;global_cost&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; 0
// assume tcm is sparse matrix in triplet form - &amp;lt;i, j, x&amp;gt;
for_each &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&amp;lt;i, j, x&amp;gt; &lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
  //compute cost &lt;span class=&quot;k&quot;&gt;function&lt;/span&gt; and add it to global cost
  global_cost +&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; J&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;x&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
  //Compute gradients &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; bias terms and perform adaptive updates &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; bias terms
  //Compute gradients &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; word vector terms and perform adaptive updates &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; word vectors
  //Update squared gradient sums &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; word vectors
  //Update squared gradient sums &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; biases
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; global_cost&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;For actual text2vec code (with a few tricks) check &lt;a href=&quot;https://github.com/dselivanov/text2vec/blob/master/src/GloveFit.h#L52-L134&quot;&gt;this loop&lt;/a&gt;.&lt;/p&gt;

&lt;h3 id=&quot;openmp&quot;&gt;OpenMP&lt;/h3&gt;

&lt;p&gt;As discussed above, all these steps can be performed in parallel loop (over all non-zero word-coocurence scores). This can be easily done via OpenMP &lt;code&gt;parallel for&lt;/code&gt; and reduction: &lt;code&gt;#pragma omp parallel for reduction(+:global_cost)&lt;/code&gt;. &lt;strong&gt;But there is one significant issue&lt;/strong&gt; with this approach - it is very hard to make portable R-package with OpenMP support. By default it will work only on linux distributions, because:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;default &lt;code&gt;clang&lt;/code&gt; on OS X don’t support OpenMP (of course you can install &lt;code&gt;clang-omp&lt;/code&gt; or &lt;code&gt;gcc&lt;/code&gt; from brew, but this also could be tricky).&lt;/li&gt;
  &lt;li&gt;Rtools begins support of OpenMP on Windows only in 2015. But even modern realization has substantial overheads.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For more details see &lt;a href=&quot;https://cran.r-project.org/doc/manuals/r-release/R-exts.html#OpenMP-support&quot;&gt;OpenMP-support&lt;/a&gt; section of Writing R Extensions manual.&lt;/p&gt;

&lt;h3 id=&quot;intel-tbb&quot;&gt;Intel TBB&lt;/h3&gt;

&lt;p&gt;Luckily we have a better alternative - &lt;a href=&quot;https://www.threadingbuildingblocks.org/&quot;&gt;Intel Thread Building Blocks&lt;/a&gt; library and &lt;a href=&quot;http://rcppcore.github.io/RcppParallel/&quot;&gt;RcppParallel&lt;/a&gt; package which provides &lt;code&gt;RVector&lt;/code&gt; and &lt;code&gt;RMatrix&lt;/code&gt; wrapper classes for safe and convenient access to R data structures in a multi-threaded environment! Moreover &lt;strong&gt;it “just works” on main platforms - OS X, Windows, Linux&lt;/strong&gt;. Have very positive experience with this library, thanks to Rcpp Core team and especially to JJ Allaire.&lt;/p&gt;

&lt;p&gt;Using TBB is little bit trickier, then writing simple OpenMP &lt;code&gt;#pragma&lt;/code&gt; directives. You should implement &lt;em&gt;functor&lt;/em&gt; which operates on a chunk of data and call &lt;code&gt;parallelReduce&lt;/code&gt; or &lt;code&gt;parallelFor&lt;/code&gt; on entire data collection. You can find useful (and simple) examples at &lt;a href=&quot;http://rcppcore.github.io/RcppParallel/#examples&quot;&gt;RcppParallel examples&lt;/a&gt; section.&lt;/p&gt;

&lt;h3 id=&quot;putting-all-together&quot;&gt;Putting all together&lt;/h3&gt;

&lt;p&gt;For now suppose, we have &lt;code&gt;partial_fit&lt;/code&gt; method in &lt;code&gt;GloveFit&lt;/code&gt; class with following signature (&lt;a href=&quot;https://github.com/dselivanov/text2vec/blob/master/src/GloveFit.h#L52-L134&quot;&gt;see actual code here&lt;/a&gt;):&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-cpp&quot; data-lang=&quot;cpp&quot;&gt;&lt;span class=&quot;kt&quot;&gt;double&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;partial_fit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;size_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;begin&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
                    &lt;span class=&quot;kt&quot;&gt;size_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;end&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
                    &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;RVector&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x_irow&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
                    &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;RVector&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x_icol&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
                    &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;RVector&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x_val&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;It takes&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;em&gt;tcm&lt;/em&gt; in sparse triplet form &lt;code&gt;&amp;lt;x_irow, x_icol, x_val&amp;gt;&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;&lt;code&gt;begin&lt;/code&gt; and &lt;code&gt;end&lt;/code&gt; pointers for a range on which we want to perform our SDG.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;And performs SGD steps over this range - &lt;a href=&quot;#single-iteration&quot;&gt;updates word vectors, gradients, etc&lt;/a&gt;. At the end it retruns value of accumulated cost function. Note, that internally this method modifies values members of the class.&lt;/p&gt;

&lt;p&gt;Also note, that signature of &lt;code&gt;partial_fit&lt;/code&gt; is very similar to what we have to implement in our TBB functor. Now we are ready to write it:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-cpp&quot; data-lang=&quot;cpp&quot;&gt;&lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;nl&quot;&gt;AdaGradIter&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Worker&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;RVector&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x_irow&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x_icol&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;RVector&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x_val&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;GloveFit&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

  &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;vocab_size&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;word_vec_size&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;num_iters&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x_max&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;kt&quot;&gt;double&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;learning_rate&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

  &lt;span class=&quot;c1&quot;&gt;// accumulated value&lt;/span&gt;
  &lt;span class=&quot;kt&quot;&gt;double&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;global_cost&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

  &lt;span class=&quot;c1&quot;&gt;// function to set global_cost = 0 between iterations&lt;/span&gt;
  &lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;set_cost_zero&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;global_cost&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;};&lt;/span&gt;
  
  &lt;span class=&quot;c1&quot;&gt;//init function to use between iterations&lt;/span&gt;
  &lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;init&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;IntegerVector&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x_irowR&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;IntegerVector&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x_icolR&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;NumericVector&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x_valR&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;IntegerVector&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;iter_orderR&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;GloveFit&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;x_irow&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;RVector&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x_irowR&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;x_icol&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;RVector&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x_icolR&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;x_val&lt;/span&gt;  &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;RVector&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x_valR&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;iter_order&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;RVector&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;iter_orderR&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;fit&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;global_cost&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;// dummy constructor&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;// used at first initialization of GloveFitter&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;AdaGradIter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;GloveFit&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;x_irow&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;IntegerVector&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)),&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;x_icol&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;IntegerVector&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)),&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;x_val&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NumericVector&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)),&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;iter_order&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;IntegerVector&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)),&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;fit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{};&lt;/span&gt;

  &lt;span class=&quot;c1&quot;&gt;// constructors&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;AdaGradIter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;IntegerVector&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x_irowR&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
              &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;IntegerVector&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x_icolR&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
              &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;NumericVector&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x_valR&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
              &lt;span class=&quot;n&quot;&gt;GloveFit&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;x_irow&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x_irowR&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x_icol&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x_icolR&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x_val&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x_valR&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;fit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;global_cost&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{}&lt;/span&gt;
    
  &lt;span class=&quot;c1&quot;&gt;// constructor callesd at split&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;AdaGradIter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AdaGradIter&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AdaGradIter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Split&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;x_irow&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AdaGradIter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x_irow&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x_icol&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AdaGradIter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x_icol&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x_val&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AdaGradIter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x_val&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; 
    &lt;span class=&quot;n&quot;&gt;fit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AdaGradIter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;global_cost&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{}&lt;/span&gt;

  &lt;span class=&quot;c1&quot;&gt;// process just the elements of the range&lt;/span&gt;
  &lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;operator&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;size_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;begin&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;size_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;end&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;global_cost&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;this&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;partial_fit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;begin&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;end&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x_irow&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x_icol&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x_val&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
  
  &lt;span class=&quot;c1&quot;&gt;// join my value with that of another global_cost&lt;/span&gt;
  &lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;join&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AdaGradIter&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rhs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;global_cost&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rhs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;global_cost&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;};&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;As you can see, it is very similar to example form RcppParallel site. One diffrence - it has side-effects. By calling &lt;code&gt;partial_fit&lt;/code&gt; it modifies internal state of the input instance of &lt;code&gt;GloveFit&lt;/code&gt; class (which actually contains our GloVe model).&lt;/p&gt;

&lt;p&gt;Now lets write &lt;code&gt;GloveFitter&lt;/code&gt; class, which will be callable from R via &lt;code&gt;Rcpp-modules&lt;/code&gt;. It will act as interface for fitting our model and take all input model parameters such as vocabulary size, desired word vectors size, initial AdaGrad learning rate, etc. Also we want to track cost between iterations and want to be able to perform some early stopping strategy between SGD iterations. For that purpose we keep our model in C++ class, so we can modify it “in place” at each SGD iteration (which can be problematic in R)&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-cpp&quot; data-lang=&quot;cpp&quot;&gt;&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;GloveFitter&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;public&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;GloveFitter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;size_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;vocab_size&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;kt&quot;&gt;size_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;word_vec_size&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;kt&quot;&gt;uint32_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x_max&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;kt&quot;&gt;double&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;learning_rate&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;kt&quot;&gt;double&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;max_cost&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;kt&quot;&gt;double&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;alpha&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;gloveFit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;vocab_size&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;n&quot;&gt;word_vec_size&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;learning_rate&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x_max&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;max_cost&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;alpha&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;adaGradIter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;gloveFit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{}&lt;/span&gt;
  
  &lt;span class=&quot;c1&quot;&gt;// function to set cost to zero from R (used between SGD iterations)&lt;/span&gt;
  &lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;set_cost_zero&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;adaGradIter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;set_cost_zero&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();};&lt;/span&gt;

  &lt;span class=&quot;kt&quot;&gt;double&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;fit_chunk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;IntegerVector&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x_irow&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                   &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;IntegerVector&lt;/span&gt;  &lt;span class=&quot;n&quot;&gt;x_icol&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                   &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;NumericVector&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x_val&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                   &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;IntegerVector&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;iter_order&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    
    &lt;span class=&quot;k&quot;&gt;this&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;adaGradIter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;init&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x_irow&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x_icol&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x_val&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;iter_order&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;gloveFit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;// &lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;parallelReduce&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x_irow&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;size&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;adaGradIter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;this&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;adaGradIter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;global_cost&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;// export word vectors to R&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;List&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;get_word_vectors&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;List&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;create&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;word_vectors&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;adaGradIter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;get_word_vectors&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;());&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;private&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;GloveFit&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;gloveFit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;AdaGradIter&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;adaGradIter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;};&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;And create wrapper with &lt;code&gt;Rcpp-Modules&lt;/code&gt;:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-cpp&quot; data-lang=&quot;cpp&quot;&gt;&lt;span class=&quot;n&quot;&gt;RCPP_MODULE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;GloveFitter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;class_&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;GloveFitter&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;GloveFitter&amp;quot;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;//&amp;lt;vocab_size, word_vec_size, x_max, learning_rate, grain_size, max_cost, alpha&amp;gt;&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;constructor&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;size_t&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;size_t&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;uint32_t&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;double&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;uint32_t&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;double&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;method&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;get_word_vectors&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;GloveFitter&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;get_word_vectors&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;returns word vectors&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;method&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;set_cost_zero&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;GloveFitter&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;set_cost_zero&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;sets cost to zero&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;method&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;fit_chunk&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;GloveFitter&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fit_chunk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;process TCM data chunk&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Now we can use &lt;code&gt;GloveFitter&lt;/code&gt; class from R:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;fit &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; new&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt; GloveFitter&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; vocabulary_size&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; word_vectors_size&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; x_max&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
            learning_rate&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; grain_size&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; max_cost&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; alpha&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
NUM_ITER &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;10&lt;/span&gt;
&lt;span class=&quot;kr&quot;&gt;for&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;i &lt;span class=&quot;kr&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;seq_len&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;NUM_ITER&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  cost &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; fit&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;fit_chunk&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;tcm&lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;i&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; tcm&lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;j&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; tcm&lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;x&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; iter_order&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;kp&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;cost&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
  fit&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;set_cost_zero&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

</content>
 </entry>
 
 <entry>
   <title>Experiments on english wikipedia. GloVe and word2vec.</title>
   <link href="http://dsnotes.com/blog/text2vec/2015/12/01/glove-enwiki"/>
   <updated>2015-12-01T00:00:00+00:00</updated>
   <id>http://dsnotes.com/blog/text2vec/2015/12/01/glove-enwiki</id>
   <content type="html">&lt;p&gt;Today I will start to publish series of posts about experiments on english wikipedia. As I said before, &lt;a href=&quot;https://github.com/dselivanov/text2vec&quot;&gt;text2vec&lt;/a&gt; is inspired by &lt;a href=&quot;https://github.com/piskvorky/gensim&quot;&gt;gensim&lt;/a&gt; - well designed and quite efficient python library for topic modeling and related NLP tasks. Also I found very useful Radim’s posts, where he tried to evaluate some algorithms on &lt;a href=&quot;http://dumps.wikimedia.org/enwiki/&quot;&gt;english wikipedia dump&lt;/a&gt;. This dataset is rather big. For example, dump for &lt;em&gt;2015-10&lt;/em&gt; (which will be used below) is &lt;strong&gt;12gb bzip2 compressed file&lt;/strong&gt;. In uncompressed form it takes about 50gb. So I can’t call it a “toy” dataset :-) You can download original files &lt;a href=&quot;http://dumps.wikimedia.org/enwiki/&quot;&gt;here&lt;/a&gt;. We are interested in file which ends with &lt;em&gt;“pages-articles.xml.bz2”&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;All evaluation and timings were done on my macbook laptop with intel core i7 cpu and 16gb of ram.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You can find all the code in the &lt;a href=&quot;https://github.com/dselivanov/word_embeddings&quot;&gt;post repository&lt;/a&gt;.&lt;/strong&gt;&lt;/p&gt;

&lt;h1 id=&quot;preparation&quot;&gt;Preparation&lt;/h1&gt;

&lt;p&gt;After getting enwiki dump we should clean it - remove wiki xml markup. I didn’t implement this stage in &lt;em&gt;text2vec&lt;/em&gt;, so we will use &lt;em&gt;gensim&lt;/em&gt;’s &lt;a href=&quot;https://github.com/piskvorky/sim-shootout&quot;&gt;scripts&lt;/a&gt; - and especially file &lt;a href=&quot;https://github.com/piskvorky/sim-shootout/blob/master/prepare_shootout.py&quot;&gt;prepare_shootout.py&lt;/a&gt;. It is not very hard to implement it in R, but this is not top priority for me at the moment. So if anybody is willing to help - please see &lt;a href=&quot;https://github.com/dselivanov/text2vec/issues/32&quot;&gt;this issue&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;After cleaning we will have &lt;em&gt;“title_tokens.txt.gz”&lt;/em&gt; file, which represents wikipedia articles - one article per line. Also each line consists of two &lt;em&gt;tab-separated&lt;/em&gt;(&lt;code&gt;&quot;\t&quot;&lt;/code&gt;) parts - title of the article and text of the article. Texts consists of &lt;em&gt;space-separated&lt;/em&gt; (&lt;code&gt;&quot; &quot;&lt;/code&gt;) words in &lt;em&gt;lowercase&lt;/em&gt;.&lt;/p&gt;

&lt;h3 id=&quot;r-io-tricks&quot;&gt;R I/O tricks&lt;/h3&gt;

&lt;p&gt;R’s &lt;code&gt;base::readLines()&lt;/code&gt; is very generic function to read lines of characters from files/connections. And because ot that, &lt;strong&gt;&lt;code&gt;readLines()&lt;/code&gt; is very slow&lt;/strong&gt;. So in text2vec I use &lt;code&gt;readr::read_lines()&lt;/code&gt; which more then 10x faster. &lt;code&gt;readr&lt;/code&gt; is a relatively new package and it has one significant drawback - it doesn’t have streaming API. This means you can’t read file line-by-line - you can only read whole file in a single function call. Sometimes this can become an issue, but usually not - user can manually split big file into chunks using command line tools and work with them. Moreover, if your perform analysis on really large amounts of data, you probably use &lt;em&gt;Apache Spark/Hadoop&lt;/em&gt; to prepare input. And usually data is stored in chunks of 64/128Mb in &lt;code&gt;hdfs&lt;/code&gt;, so it is very natural to work with such chunks instead of single file.&lt;/p&gt;

&lt;p&gt;For this post, I splitted &lt;code&gt;title_tokens.txt.gz&lt;/code&gt; into 100mb chunks using &lt;code&gt;split&lt;/code&gt; command utility:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;gunzip -c title_tokens.txt.gz &lt;span class=&quot;p&quot;&gt;|&lt;/span&gt; split --line-bytes&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;100m --filter&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&amp;#39;gzip --fast &amp;gt; ~/Downloads/datasets/$FILE.gz&amp;#39;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;If you are on &lt;strong&gt;OS X&lt;/strong&gt;, install &lt;code&gt;coreutils&lt;/code&gt; first: &lt;code&gt;brew install coreutils&lt;/code&gt; and use &lt;code&gt;gsplit&lt;/code&gt; command:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;gunzip -c title_tokens.txt.gz &lt;span class=&quot;p&quot;&gt;|&lt;/span&gt; gsplit --line-bytes&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;100m --filter&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&amp;#39;gzip --fast &amp;gt; ~/Downloads/datasets/$FILE.gz&amp;#39;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;In all the code below we will use &lt;code&gt;title_tokens.txt.gz&lt;/code&gt; file as input for &lt;em&gt;gesnim&lt;/em&gt; and &lt;code&gt;title_tokens_splits/&lt;/code&gt; directory as input for &lt;em&gt;text2vec&lt;/em&gt;.&lt;/p&gt;

&lt;h1 id=&quot;word-embeddings&quot;&gt;Word embeddings&lt;/h1&gt;

&lt;p&gt;Here I want to demonstrate how to use text2vec’s &lt;a href=&quot;http://nlp.stanford.edu/projects/glove/&quot;&gt;GloVe&lt;/a&gt; implementation and briefly compare its performance with &lt;a href=&quot;https://code.google.com/p/word2vec/&quot;&gt;word2vec&lt;/a&gt;. Originally I had plans to implement &lt;em&gt;word2vec&lt;/em&gt;, but after reviewing &lt;a href=&quot;www-nlp.stanford.edu/pubs/glove.pdf&quot;&gt;GloVe paper&lt;/a&gt;, I changed my mind. If you still haven’t read it, I strongly recommend to do that.&lt;/p&gt;

&lt;p&gt;So, this post has several goals:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Demonstrate how to process large collections of documents (that don’t fit into RAM) with &lt;strong&gt;text2vec&lt;/strong&gt;.&lt;/li&gt;
  &lt;li&gt;Provide tutorial on &lt;em&gt;text2vec&lt;/em&gt; GloVe word embeddings functionality.&lt;/li&gt;
  &lt;li&gt;Compare &lt;em&gt;text2vec GloVe&lt;/em&gt; and &lt;em&gt;gensim word2vec&lt;/em&gt; in terms of:
    &lt;ol&gt;
      &lt;li&gt;accuracy&lt;/li&gt;
      &lt;li&gt;execution time&lt;/li&gt;
      &lt;li&gt;RAM consumption&lt;/li&gt;
    &lt;/ol&gt;
  &lt;/li&gt;
  &lt;li&gt;Briefly highlight advantages and drawbacks of current implementation. (I’ll write separate post with more details about technical aspects.)&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;baseline&quot;&gt;Baseline&lt;/h2&gt;

&lt;p&gt;Here we will follow excellent Radim’s &lt;a href=&quot;http://rare-technologies.com/making-sense-of-word2vec/&quot;&gt;Making sense of word2vec&lt;/a&gt; post and try to replicate his results.&lt;/p&gt;

&lt;h3 id=&quot;just-to-remind-results&quot;&gt;Just to remind results&lt;/h3&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;## Loading required package: methods&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img src=&quot;/../images/2015-12-01-glove-enwiki/unnamed-chunk-3-1.png&quot; alt=&quot;center&quot; /&gt;&lt;/p&gt;

&lt;p&gt;You can find corresponding original repository &lt;a href=&quot;https://github.com/piskvorky/word_embeddings&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;modifications&quot;&gt;Modifications&lt;/h2&gt;
&lt;p&gt;I made a few minor modifications in Radim’s code.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;I don’t evaluate &lt;code&gt;glove-python&lt;/code&gt; for the following reasons:
    &lt;ol&gt;
      &lt;li&gt;Radim uses dense numpy matrix to store cooccurencies. While it is great for 30K vocabulary (&lt;code&gt;float32&lt;/code&gt; dense matrix occupies ~ 3.6gb and it takes less time to fill it), it is not appropriate for larger vocabularies (for example &lt;code&gt;float32&lt;/code&gt; matrix for 100K vocabulary will occupy ~ 40gb).&lt;/li&gt;
      &lt;li&gt;Orginal &lt;a href=&quot;https://github.com/maciejkula/glove-python&quot;&gt;glove-python&lt;/a&gt; creates sparse cooccurence matrix, but for some reason it has very poor performance (accuracy on analogue task ~1-2%). I’m not very familiar with python, so can’t figure out what is wrong. If somebody can fix this issue - let me know, I would be happy to add &lt;em&gt;glove-python&lt;/em&gt; to this comparison.&lt;/li&gt;
    &lt;/ol&gt;
  &lt;/li&gt;
  &lt;li&gt;Construct vocabulary from top 30k words produced by &lt;em&gt;text2vec&lt;/em&gt; vocabulary builder. &lt;em&gt;gensim&lt;/em&gt; takes into account title of the article, which can contain upper-case words, punctuation, etc. I found that models which are based on vocabulary constructed from only articles body (not incuding title) are more accurate. This is true for both, GloVe and word2vec.&lt;/li&gt;
&lt;/ol&gt;

&lt;h1 id=&quot;building-the-model&quot;&gt;Building the model&lt;/h1&gt;

&lt;p&gt;I will &lt;strong&gt;focus on text2vec details&lt;/strong&gt; here, because gensim word2vec code is almost the same as in Radim’s post (again - all code you can find in &lt;a href=&quot;https://github.com/dselivanov/word_embeddings&quot;&gt;this repo&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Install &lt;em&gt;test2vec&lt;/em&gt; from github:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;devtools&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;install_github&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;dselivanov/text2vec&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2 id=&quot;vocabulary&quot;&gt;Vocabulary&lt;/h2&gt;

&lt;p&gt;First of all we need to build a vocabulary:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;kn&quot;&gt;library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;text2vec&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# create iterator over files in directory&lt;/span&gt;
it &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; idir&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;~/Downloads/datasets/splits/&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# create iterator over tokens&lt;/span&gt;
it2 &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; itoken&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;it&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
              preprocess_function &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kr&quot;&gt;function&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;x&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; 
                str_split&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;x&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; fixed&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;\t&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;%&amp;gt;%&lt;/span&gt; 
                &lt;span class=&quot;c1&quot;&gt;# select only the body of the article&lt;/span&gt;
                &lt;span class=&quot;kp&quot;&gt;sapply&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;.&lt;/span&gt;subset2&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; 
              tokenizer &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kr&quot;&gt;function&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;x&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; str_split&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;x&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; fixed&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot; &amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)))&lt;/span&gt;
vocab &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; vocabulary&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;it2&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;On my machine it takes about &lt;em&gt;1150 sec&lt;/em&gt;, while &lt;em&gt;gensim&lt;/em&gt; &lt;code&gt;gensim.corpora.Dictionary()&lt;/code&gt; takes about &lt;em&gt;2100 sec&lt;/em&gt;. &lt;em&gt;RAW I/O&lt;/em&gt; is about &lt;em&gt;~ 150 sec&lt;/em&gt;.
&lt;img src=&quot;/../images/2015-12-01-glove-enwiki/unnamed-chunk-6-1.png&quot; alt=&quot;center&quot; /&gt;&lt;/p&gt;

&lt;h3 id=&quot;pruning-vocabulary&quot;&gt;Pruning vocabulary&lt;/h3&gt;

&lt;p&gt;We received all unique words and corresponding statistics:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;str&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;vocab&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;##List of 2&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;## $ vocab:&amp;#39;data.frame&amp;#39;:	8306153 obs. of  4 variables:&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;##  ..$ terms          : chr [1:8306153] &amp;quot;bonnerj&amp;quot; &amp;quot;beerworthc&amp;quot; &amp;quot;danielst&amp;quot; &amp;quot;anchaka&amp;quot; ...&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;##  ..$ terms_counts   : int [1:8306153] 1 1 1 1 1 1 1 1 1 1 ...&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;##  ..$ doc_counts     : int [1:8306153] 1 1 1 1 1 1 1 1 1 1 ...&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;##  ..$ doc_proportions: num [1:8306153] 2.55e-07 2.55e-07 2.55e-07 2.55e-07 2.55e-07 ...&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;## $ ngram: Named int [1:2] 1 1&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;##  ..- attr(*, &amp;quot;names&amp;quot;)= chr [1:2] &amp;quot;ngram_min&amp;quot; &amp;quot;ngram_max&amp;quot;&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;## - attr(*, &amp;quot;class&amp;quot;)= chr &amp;quot;text2vec_vocabulary&amp;quot;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;But we are interested in only frequent words, so we should filter out rare words. text2vec provides &lt;code&gt;prune_vocabulary()&lt;/code&gt; function which has many useful options for that:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;TOKEN_LIMIT &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;30000L&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# filter out tokens which are represented at least in 30% of documents&lt;/span&gt;
TOKEN_DOC_PROPORTION &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;0.3&lt;/span&gt;
pruned_vocab &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; prune_vocabulary&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;vocabulary &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; vocab&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
                                     doc_proportion_max &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; TOKEN_DOC_PROPORTION&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
                                     max_number_of_terms &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; TOKEN_LIMIT&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# save to csv to use in gensim word2vec&lt;/span&gt;
write.table&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;x &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;data.frame&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;word&amp;quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; pruned_vocab&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;vocab&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;terms&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;id&amp;quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;TOKEN_LIMIT &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)),&lt;/span&gt; 
            file &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;/path/to/destination/dir/pruned_vocab.csv&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            quote &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;F&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; sep &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;,&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; row.names &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;F&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; col.names &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;F&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2 id=&quot;corpus-construction&quot;&gt;Corpus construction&lt;/h2&gt;

&lt;p&gt;Now we have vocabulary and can construct term-cooccurence matrix (&lt;strong&gt;tcm&lt;/strong&gt;).&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;WINDOW &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;10L&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# create iterator over files in directory&lt;/span&gt;
it &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; idir&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;~/Downloads/datasets/splits/&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# create iterator over tokens&lt;/span&gt;
it2 &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; itoken&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;it&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
              preprocess_function &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kr&quot;&gt;function&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;x&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; 
                str_split&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;x&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; fixed&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;\t&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;%&amp;gt;%&lt;/span&gt; 
                &lt;span class=&quot;c1&quot;&gt;# select only the body of the article&lt;/span&gt;
                &lt;span class=&quot;kp&quot;&gt;sapply&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;.&lt;/span&gt;subset2&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; 
              tokenizer &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kr&quot;&gt;function&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;x&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; str_split&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;x&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; fixed&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot; &amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)))&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# create_vocab_corpus can construct documen-term matrix and term-cooccurence matrix simultaneously&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# here we are not interesting in documen-term matrix, so set `grow_dtm = FALSE`&lt;/span&gt;
corpus &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; create_vocab_corpus&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;it2&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; vocabulary &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; pruned_vocab&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; grow_dtm &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kc&quot;&gt;FALSE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; skip_grams_window &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; WINDOW&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# in this call, we wrap std::unordered_map into R&amp;#39;s dgTMatrix&lt;/span&gt;
tcm &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; get_tcm&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;corpus&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Operation above takes about &lt;strong&gt;80 minutes&lt;/strong&gt; on my machine and at peak consumes about &lt;strong&gt;11gb of RAM&lt;/strong&gt;.&lt;/p&gt;

&lt;h3 id=&quot;short-note-on-memory-consumption&quot;&gt;Short note on memory consumption&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;At the moment, during corpus construction, &lt;em&gt;text2vec&lt;/em&gt; keeps entire term-cooccurence matrix in memory&lt;/strong&gt;. In future versions it can be changed (Quite easily via simple map-reduce style algorithm. Exactly the same way it done in original Stanford implementation).&lt;/p&gt;

&lt;p&gt;As you can see memory consumption is rather high. But if we do some basic calculations we will realize that memory consumption is not so high:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Internally &lt;em&gt;text2vec&lt;/em&gt; stores &lt;code&gt;tcm&lt;/code&gt; as &lt;code&gt;std::unordered_map&amp;lt;std::pair&amp;lt;uint32_t, uint32_t&amp;gt;, float&amp;gt;&lt;/code&gt;.&lt;/li&gt;
  &lt;li&gt;We &lt;strong&gt;store&lt;/strong&gt; only elements which are &lt;strong&gt;above main diagonal&lt;/strong&gt;, because our matrix is &lt;strong&gt;symmetric&lt;/strong&gt;.&lt;/li&gt;
  &lt;li&gt;&lt;code&gt;tcm&lt;/code&gt; above consists of ~ &lt;code&gt;200e6&lt;/code&gt; elements. Matrix is &lt;strong&gt;quite dense - ~ 22% non-zero elements&lt;/strong&gt; (above diagonal! in case of full storage, matrix will be ~ 44% dense).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;So, &lt;em&gt;200e6 * (4 + 4 + 8) = ~ 3.2 gb&lt;/em&gt; - only memory to store our matrix in sparse triplet form using preallocated vectors. Also we should add usual &lt;strong&gt;3-4x &lt;code&gt;std::unordered_map&lt;/code&gt; overhead&lt;/strong&gt; and memory allocated for wrapping &lt;code&gt;unordered_map&lt;/code&gt; into R sparse triplet &lt;code&gt;dgTMatrix&lt;/code&gt;.&lt;/p&gt;

&lt;h2 id=&quot;glove-training&quot;&gt;GloVe training&lt;/h2&gt;

&lt;p&gt;We perform GloVe fitting using AdaGrad - stochastic gradient descend with per-feature adaptive learning rate. Also, fitting is done in fully parallel and asynchronous manner ( see &lt;a href=&quot;https://www.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf&quot;&gt;Hogwild! paper&lt;/a&gt; ), so it can benefit from machines with multiple cores. In my tests I achieved almost 8x speedup on 8 core machine on the discribed above wikipedia dataset.&lt;/p&gt;

&lt;p&gt;Now we are ready to train our GloVe model. Here we will perform maximum 20 iterations. Also we will track our global cost and its improvement over iterations. We will stop fitting when improvement (in relation to previous epoch) will become smaller than given threshold - &lt;code&gt;convergence_threshold&lt;/code&gt;.&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;DIM &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;600L&lt;/span&gt;
X_MAX &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;100L&lt;/span&gt;
WORKERS &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;4L&lt;/span&gt;
NUM_ITERS &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;20L&lt;/span&gt;
CONVERGENCE_THRESHOLD &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;0.005&lt;/span&gt;
LEARNING_RATE &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;0.15&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# explicitly set number of threads&lt;/span&gt;
RcppParallel&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;setThreadOptions&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;numThreads &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; WORKERS&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

fit &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; glove&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;tcm &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; tcm&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
             word_vectors_size &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; DIM&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
             num_iters &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; NUM_ITERS&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
             learning_rate &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; LEARNING_RATE&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
             x_max &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; X_MAX&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
             shuffle_seed &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;42L&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
             &lt;span class=&quot;c1&quot;&gt;# we will stop if global cost will be reduced less then 0.5% then previous SGD iteration&lt;/span&gt;
             convergence_threshold &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; CONVERGENCE_THRESHOLD&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This takes about &lt;strong&gt;431 minutes&lt;/strong&gt; on my machine and stops on 20 iteration (no early stopping):&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;2015-12-01 06:37:27 - epoch 20, expected cost 0.0145&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Accuracy&lt;/strong&gt; on analogue dataset is &lt;strong&gt;0.759&lt;/strong&gt;:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;words &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;rownames&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;tcm&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
m &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; fit&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;word_vectors&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;w_i &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; fit&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;word_vectors&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;w_j
&lt;span class=&quot;kp&quot;&gt;rownames&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;m&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;  words

questions_file &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;~/Downloads/datasets/questions-words.txt&amp;#39;&lt;/span&gt;
qlst &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; prepare_analogue_questions&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;questions_file&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;rownames&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;m&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
res &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; check_analogue_accuracy&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;questions_lst &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; qlst&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; m_word_vectors &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; m&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;blockquote&gt;
  &lt;p&gt;2015-12-01 06:48:23 - capital-common-countries: correct 476 out of 506, accuracy = 0.9407&lt;br /&gt;
2015-12-01 06:48:27 - capital-world: correct 2265 out of 2359, accuracy = 0.9602&lt;br /&gt;
2015-12-01 06:48:28 - currency: correct 4 out of 86, accuracy = 0.0465&lt;br /&gt;
2015-12-01 06:48:31 - city-in-state: correct 1828 out of 2330, accuracy = 0.7845&lt;br /&gt;
2015-12-01 06:48:32 - family: correct 272 out of 306, accuracy = 0.8889&lt;br /&gt;
2015-12-01 06:48:33 - gram1-adjective-to-adverb: correct 179 out of 650, accuracy = 0.2754&lt;br /&gt;
2015-12-01 06:48:33 - gram2-opposite: correct 131 out of 272, accuracy = 0.4816&lt;br /&gt;
2015-12-01 06:48:34 - gram3-comparative: correct 806 out of 930, accuracy = 0.8667&lt;br /&gt;
2015-12-01 06:48:35 - gram4-superlative: correct 279 out of 506, accuracy = 0.5514&lt;br /&gt;
2015-12-01 06:48:37 - gram5-present-participle: correct 445 out of 870, accuracy = 0.5115&lt;br /&gt;
2015-12-01 06:48:39 - gram6-nationality-adjective: correct 1364 out of 1371, accuracy = 0.9949&lt;br /&gt;
2015-12-01 06:48:41 - gram7-past-tense: correct 836 out of 1406, accuracy = 0.5946&lt;br /&gt;
2015-12-01 06:48:42 - gram8-plural: correct 833 out of 1056, accuracy = 0.7888&lt;br /&gt;
2015-12-01 06:48:43 - gram9-plural-verbs: correct 341 out of 600, accuracy = 0.5683&lt;br /&gt;
2015-12-01 06:48:43 - OVERALL ACCURACY = 0.7593&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;, that sometimes AdaGrad converges to poorer local minima with larger cost. This means model will produce less accurate predictions. For example in some experiments while writing this post I stopped with &lt;code&gt;cost = 0.190&lt;/code&gt; and &lt;code&gt;accuracy = ~ 0.72&lt;/code&gt;. Also fitting &lt;strong&gt;can be sensitive to initial learning rate&lt;/strong&gt;, some experiments still needed.&lt;/p&gt;

&lt;p&gt;Training &lt;strong&gt;word2vec takes 401 minutes&lt;/strong&gt; and &lt;strong&gt;accuracy = 0.687&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;As we can see, GloVe shows significantly better accuaracy.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/../images/2015-12-01-glove-enwiki/unnamed-chunk-12-1.png&quot; alt=&quot;center&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Closer look to resources usage:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/../images/2015-12-01-glove-enwiki/unnamed-chunk-13-1.png&quot; alt=&quot;center&quot; /&gt;&lt;/p&gt;

&lt;h3 id=&quot;faster-training&quot;&gt;Faster training&lt;/h3&gt;

&lt;p&gt;If you are more interested in training time you can do the following:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Make more zeros in &lt;code&gt;tcm&lt;/code&gt; by removing too rare cooccurences:&lt;/li&gt;
&lt;/ol&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;ind &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; tcm&lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;x &lt;span class=&quot;o&quot;&gt;&amp;gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;
    tcm&lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;x &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; tcm&lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;x&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;ind&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
    tcm&lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;i &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; tcm&lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;i&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;ind&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
    tcm&lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;j &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; tcm&lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;j&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;ind&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;ol&gt;
  &lt;li&gt;10 iterations with lower word vector dimensions (&lt;code&gt;DIM = 300&lt;/code&gt; for example).&lt;/li&gt;
&lt;/ol&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;DIM &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;300L&lt;/span&gt;
    NUM_ITERS &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;10&lt;/span&gt;
    fit &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; glove&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;tcm &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; tcm&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
                 word_vectors_size &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; DIM&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
                 num_iters &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; NUM_ITERS&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                 learning_rate &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;0.1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                 x_max &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; X_MAX&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
                 shuffle_seed &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;42L&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
                 max_cost &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                 &lt;span class=&quot;c1&quot;&gt;# we will stop if global cost will be reduced less then 1% then previous SGD iteration&lt;/span&gt;
                 convergence_threshold &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;0.01&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    words &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;rownames&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;tcm&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    m &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; fit&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;word_vectors&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;w_i &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; fit&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;word_vectors&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;w_j
    &lt;span class=&quot;kp&quot;&gt;rownames&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;m&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;  words
    questions_file &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;~/Downloads/datasets/questions-words.txt&amp;#39;&lt;/span&gt;
    qlst &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; prepare_analogue_questions&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;questions_file&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;rownames&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;m&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
    res &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; check_analogue_accuracy&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;questions_lst &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; qlst&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; m_word_vectors &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; m&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Training takes 50 minutes on 4-core machine and get ~68% accuracy:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;2015-11-30 15:13:51 - capital-common-countries: correct 482 out of 506, accuracy = 0.9526&lt;br /&gt;
2015-11-30 15:13:54 - capital-world: correct 2235 out of 2359, accuracy = 0.9474&lt;br /&gt;
2015-11-30 15:13:54 - currency: correct 1 out of 86, accuracy = 0.0116&lt;br /&gt;
2015-11-30 15:13:57 - city-in-state: correct 1540 out of 2330, accuracy = 0.6609&lt;br /&gt;
2015-11-30 15:13:57 - family: correct 247 out of 306, accuracy = 0.8072&lt;br /&gt;
2015-11-30 15:13:58 - gram1-adjective-to-adverb: correct 142 out of 650, accuracy = 0.2185&lt;br /&gt;
2015-11-30 15:13:58 - gram2-opposite: correct 87 out of 272, accuracy = 0.3199&lt;br /&gt;
2015-11-30 15:13:59 - gram3-comparative: correct 663 out of 930, accuracy = 0.7129&lt;br /&gt;
2015-11-30 15:14:00 - gram4-superlative: correct 171 out of 506, accuracy = 0.3379&lt;br /&gt;
2015-11-30 15:14:01 - gram5-present-participle: correct 421 out of 870, accuracy = 0.4839&lt;br /&gt;
2015-11-30 15:14:03 - gram6-nationality-adjective: correct 1340 out of 1371, accuracy = 0.9774&lt;br /&gt;
2015-11-30 15:14:04 - gram7-past-tense: correct 608 out of 1406, accuracy = 0.4324&lt;br /&gt;
2015-11-30 15:14:06 - gram8-plural: correct 771 out of 1056, accuracy = 0.7301&lt;br /&gt;
2015-11-30 15:14:06 - gram9-plural-verbs: correct 266 out of 600, accuracy = 0.4433&lt;br /&gt;
2015-11-30 15:14:06 - OVERALL ACCURACY = 0.6774&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;img src=&quot;/../images/2015-12-01-glove-enwiki/unnamed-chunk-16-1.png&quot; alt=&quot;center&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/../images/2015-12-01-glove-enwiki/unnamed-chunk-17-1.png&quot; alt=&quot;center&quot; /&gt;&lt;/p&gt;

&lt;h1 id=&quot;summary&quot;&gt;Summary&lt;/h1&gt;

&lt;h2 id=&quot;advantages&quot;&gt;Advantages&lt;/h2&gt;

&lt;ol&gt;
  &lt;li&gt;As we see &lt;em&gt;text2vec&lt;/em&gt;’s &lt;em&gt;GloVe&lt;/em&gt; implementation looks like a good alternative to word2vec and outperforms it in terms of accuracy and running time (we can pick a set of parameters on which it will be &lt;strong&gt;both faster and more accurate&lt;/strong&gt;).&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Early stopping&lt;/strong&gt;. We can stop training when improvements become small.&lt;/li&gt;
  &lt;li&gt;&lt;code&gt;tcm&lt;/code&gt; is &lt;strong&gt;reusable&lt;/strong&gt;. May be it is more fair to subtract timings for &lt;code&gt;tcm_creation&lt;/code&gt; from benchmarks above.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Incremental fitting&lt;/strong&gt;. You can easily adjust &lt;code&gt;tcm&lt;/code&gt; with new data and continue fitting. And it will converge very quickly.&lt;/li&gt;
  &lt;li&gt;&lt;em&gt;text2vec&lt;/em&gt; works on OS X, Linux and even Windows without any tricks/adjustments/manual configuration. Thanks to &lt;em&gt;Intel Thread Building Blocks&lt;/em&gt; and &lt;a href=&quot;rcppcore.github.io/RcppParallel/&quot;&gt;RcppParallel&lt;/a&gt;. It was a little bit simpler to program AdaGrad using &lt;em&gt;OpenMP&lt;/em&gt; (as I actually did in my first attempt), but this leads to issues at installation time, especially on OS X and Windows machines.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;drawbacks-and-what-can-be-improved&quot;&gt;Drawbacks and what can be improved&lt;/h2&gt;

&lt;ol&gt;
  &lt;li&gt;One &lt;strong&gt;drawback&lt;/strong&gt; - it &lt;strong&gt;uses a lot of memory&lt;/strong&gt; (in contrast to &lt;em&gt;gensim&lt;/em&gt; which is very memory-friendly. I was very impressed.). But this is natural - the fastest way to costruct &lt;code&gt;tcm&lt;/code&gt; is to keep it in RAM as hash map and perform cooccurence increments in a global manner. Also note that we build &lt;code&gt;tcm&lt;/code&gt; on top 30000 terms. Because of that it is very dense. I tried to build model on top 100000 terms and had no problems on machine with 32gb RAM. Matrix was much more sparse - ~ 4% of non zero elements. Anyway, one can implement simple map-reduce style algorithm to construct &lt;code&gt;tcm&lt;/code&gt;. (using files as it done in original Stanford implemention) and then fit model in streaming manner.&lt;/li&gt;
  &lt;li&gt;Sometimes model is quite &lt;strong&gt;sensitive to initial learning rate&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;

</content>
 </entry>
 
 <entry>
   <title>Analyzing texts with text2vec package.</title>
   <link href="http://dsnotes.com/blog/text2vec/2015/11/09/text2vec"/>
   <updated>2015-11-09T00:00:00+00:00</updated>
   <id>http://dsnotes.com/blog/text2vec/2015/11/09/text2vec</id>
   <content type="html">&lt;p&gt;In the last weeks I have actively worked on &lt;a href=&quot;https://github.com/dselivanov/text2vec&quot;&gt;text2vec&lt;/a&gt; (formerly tmlite) - R package, which provides tools for fast text vectorization and state-of-the art word embeddings.&lt;/p&gt;

&lt;p&gt;This project is an experiment for me - what can a single person do in a particular area? After these hard weeks, I believe, he can do a lot.&lt;/p&gt;

&lt;p&gt;There are a lot of changes from my previous &lt;a href=&quot;http://dsnotes.com/blog/2015/09/16/tmlite-intro/&quot;&gt;introduction post&lt;/a&gt;, and I want to highlight few of them:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Package was renamed to &lt;strong&gt;text2vec&lt;/strong&gt;, because, I believe, this name better reflects its functionality.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;New API&lt;/strong&gt;. More clean, more concise.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;GloVe&lt;/strong&gt; word embeddings. Training is &lt;strong&gt;fully parallelized&lt;/strong&gt; - asynchronous SGD with adaptive learning rate (AdaGrad). &lt;strong&gt;Works on all platforms, including windows.&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;Added &lt;strong&gt;ngram&lt;/strong&gt; feature to vectorization. Now it is very easy to build &lt;em&gt;Document-Term matrix&lt;/em&gt;, using arbitrary &lt;code&gt;ngrams&lt;/code&gt; instead of simple &lt;code&gt;unigrams&lt;/code&gt;.&lt;/li&gt;
  &lt;li&gt;Switched to &lt;code&gt;MurmurHash3&lt;/code&gt; for &lt;strong&gt;feature hashing&lt;/strong&gt; and add &lt;code&gt;signed_hash&lt;/code&gt; option, which can &lt;a href=&quot;https://en.wikipedia.org/wiki/Feature_hashing#Feature_vectorization_using_the_hashing_trick&quot;&gt;reduce the effect of collisions&lt;/a&gt;.&lt;/li&gt;
  &lt;li&gt;Now text2vec uses regular exressions engine from &lt;code&gt;stringr&lt;/code&gt; package (which is built on top of &lt;code&gt;stringi&lt;/code&gt;). Now &lt;code&gt;regexp_tokenizer&lt;/code&gt; much is more fast and robust. Simple &lt;code&gt;word_tokenizer&lt;/code&gt;is also provided.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In this post I’ll focus on text vectorization tools provided by &lt;a href=&quot;https://github.com/dselivanov/text2vec&quot;&gt;text2vec&lt;/a&gt;. Also, it will be a base for a &lt;code&gt;text2vec&lt;/code&gt; vignette. I’ll write another post about &lt;a href=&quot;http://nlp.stanford.edu/projects/glove/&quot;&gt;GloVe&lt;/a&gt; next week, don’t miss it.&lt;/p&gt;

&lt;p&gt;Plese, don’t forgive to install &lt;code&gt;text2vec&lt;/code&gt; first:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;devtools&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;install_github&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;dselivanov/text2vec&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h1 id=&quot;features&quot;&gt;Features&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;text2vec&lt;/strong&gt; is a package for which the main goal is to provide an &lt;strong&gt;efficient framework&lt;/strong&gt; with &lt;strong&gt;concise API&lt;/strong&gt; for &lt;strong&gt;text analysis&lt;/strong&gt; and &lt;strong&gt;natural language processing (NLP)&lt;/strong&gt; in R. It is inspired by &lt;a href=&quot;http://radimrehurek.com/gensim/&quot;&gt;gensim&lt;/a&gt; - an excellent python library for NLP.&lt;/p&gt;

&lt;h2 id=&quot;core-functionality&quot;&gt;Core functionality&lt;/h2&gt;

&lt;p&gt;At the moment we cover two following topics:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Fast text vectorization on arbitrary n-grams.
    &lt;ul&gt;
      &lt;li&gt;using vocabulary&lt;/li&gt;
      &lt;li&gt;using feature hashing&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;State-of-the-art &lt;a href=&quot;http://www-nlp.stanford.edu/projects/glove/&quot;&gt;GloVe&lt;/a&gt; word embeddings.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;efficiency&quot;&gt;Efficiency&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;The core of the functionality is &lt;strong&gt;carefully written in C++&lt;/strong&gt;. Also this means text2vec is &lt;strong&gt;memory friendly&lt;/strong&gt;.&lt;/li&gt;
  &lt;li&gt;Some parts (GloVe training) are fully &lt;strong&gt;parallelized&lt;/strong&gt; using an excellent &lt;a href=&quot;http://rcppcore.github.io/RcppParallel/&quot;&gt;RcppParallel&lt;/a&gt; package. This means, &lt;strong&gt;parallel features work on OS X, Linux, Windows and Solaris(x86) without any additinal tuning/hacking/tricks&lt;/strong&gt;.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Streaming API&lt;/strong&gt;, this means users don’t have to load all the data into RAM. &lt;strong&gt;text2vec&lt;/strong&gt; allows processing streams of chunks.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;api&quot;&gt;API&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;Built around &lt;a href=&quot;https://en.wikipedia.org/wiki/Iterator&quot;&gt;iterator&lt;/a&gt; abstraction.&lt;/li&gt;
  &lt;li&gt;Concise, provides only a few functions, which do their job well.&lt;/li&gt;
  &lt;li&gt;Don’t (and probably will not in future) provide trivial very high-level functions.&lt;/li&gt;
&lt;/ul&gt;

&lt;h1 id=&quot;terminology-and-what-is-under-the-hood&quot;&gt;Terminology and what is under the hood&lt;/h1&gt;

&lt;p&gt;As stated before, text2vec is built around streaming API and &lt;strong&gt;iterators&lt;/strong&gt;, which allows the constructin of the &lt;strong&gt;corpus&lt;/strong&gt; from &lt;em&gt;iterable&lt;/em&gt; objects. Here we touched 2 main concepts:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Corpus&lt;/strong&gt;.  In text2vec it is an object, which contains tokens and other information / metainformation which is used for text vectorization and other processing. We can be efficiently insert documents into corpus, because,  technically, &lt;strong&gt;Corpus&lt;/strong&gt; is an C++ class, wrapped with &lt;em&gt;Rcpp-modules&lt;/em&gt; as &lt;em&gt;reference class&lt;/em&gt; (which has reference semantics!). Usually user should not care about this, but should keep in mind nature of such objects. Particularly important, that user have to remember, that he can’t save/serialize such objects using R’s &lt;code&gt;save*()&lt;/code&gt; methods. But good news is that he can easily and efficiently extract corresponding R objects from corpus and work with them in a usual way.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Iterators&lt;/strong&gt;. If you are not familliar with them in &lt;code&gt;R&#39;s&lt;/code&gt; context, I highly recommend to review vignettes of &lt;a href=&quot;https://cran.r-project.org/web/packages/iterators/&quot;&gt;iterators&lt;/a&gt; package. A big advantage of this abstraction is that  it allows us to be &lt;strong&gt;agnostic of type of input&lt;/strong&gt; - we can transparently change it by just providing correct iterator.&lt;/li&gt;
&lt;/ol&gt;

&lt;h1 id=&quot;text-vectorization&quot;&gt;Text vectorization&lt;/h1&gt;

&lt;p&gt;Historically, most of the text-mining and NLP modelling was related to &lt;a href=&quot;https://en.wikipedia.org/wiki/Bag-of-words_model&quot;&gt;Bag-of-words&lt;/a&gt; or &lt;a href=&quot;https://en.wikipedia.org/wiki/N-gram&quot;&gt;Bag-of-ngrams&lt;/a&gt; models. Despite of simplicity, these models usually demonstrates good performance on text categorization/classification tasks. But, in contrast to theoretical simplicity and practical efficiency, building &lt;em&gt;bag-of-words&lt;/em&gt; models involves technical challenges. Especially within &lt;code&gt;R&lt;/code&gt; framework, because of its typical copy-on-modify semantics.&lt;/p&gt;

&lt;h2 id=&quot;pipeline&quot;&gt;Pipeline&lt;/h2&gt;

&lt;p&gt;Lets briefly review some details of typical analysis pipeline:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Usually reseacher have to construct &lt;a href=&quot;https://en.wikipedia.org/wiki/Document-term_matrix&quot;&gt;Document-Term matrix&lt;/a&gt; (DTM) from imput documents. Or in other words, &lt;strong&gt;vectorize text&lt;/strong&gt; - create mapping from words/ngrams to &lt;a href=&quot;https://en.wikipedia.org/wiki/Vector_space_model&quot;&gt;vector space&lt;/a&gt;.&lt;/li&gt;
  &lt;li&gt;Fit model on this DTM. This can include:
    &lt;ul&gt;
      &lt;li&gt;text classification&lt;/li&gt;
      &lt;li&gt;topic modeling&lt;/li&gt;
      &lt;li&gt;…&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;Tune, validate model.&lt;/li&gt;
  &lt;li&gt;Apply model on new data.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Here we will discuss mostly first stage. Underlying texts can take a lot of space, but vectorized ones usually not, because they are stored in form of sparse matrices. In R it is not very easy (from reason above - copy-on-modify semantics) to iteratively grow DTM. So construction of such objects, even for small collections of documents, can become serious hedache for analysts and researchers. It involves reading the whole collection of text documents into RAM and process it as single vector, which easily increase memory consumption by factor of 2 to 4 (to tell the truth, this is quite optimistically). Fortunately, there is a better, text2vec way. Lets check how it works on simple example.&lt;/p&gt;

&lt;h2 id=&quot;sentiment-analysis-on-imdb-moview-review-dataset&quot;&gt;Sentiment analysis on IMDB moview review dataset&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;text2vec&lt;/strong&gt; provides &lt;code&gt;movie_review&lt;/code&gt; dataset. It consists of 25000 movie review, each of which marked ad positive or negative.&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;kn&quot;&gt;library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;text2vec&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;## Loading required package: methods&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;data&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;movie_review&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# str(movie_review, nchar.max = 20, width = 80, strict.width = &amp;#39;wrap&amp;#39;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;To represent documents in vector space, first of all we have to create &lt;code&gt;term -&amp;gt; term_id&lt;/code&gt; mappings. We use termin &lt;em&gt;term&lt;/em&gt; instead of &lt;em&gt;word&lt;/em&gt;, because actually it can be arbitrary &lt;em&gt;ngram&lt;/em&gt;, not just single word.  Having set of documents we want represent them as &lt;em&gt;sparse matrix&lt;/em&gt;, where each row should corresponds to &lt;em&gt;document&lt;/em&gt; and each column should corresponds to &lt;em&gt;term&lt;/em&gt;. This can be done in 2 ways: using &lt;strong&gt;vocabulary&lt;/strong&gt;, or by &lt;strong&gt;feature hashing&lt;/strong&gt; (hashing trick).&lt;/p&gt;

&lt;h3 id=&quot;vocabulary-based-vectorization&quot;&gt;Vocabulary based vectorization&lt;/h3&gt;
&lt;p&gt;Lets examine the first choice. He we collect unique terms from all documents and mark them with &lt;em&gt;unique_id&lt;/em&gt;. &lt;code&gt;vocabulary()&lt;/code&gt; function designed specially for this purpose.&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;it &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; itoken&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;movie_review&lt;span class=&quot;p&quot;&gt;[[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;review&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]],&lt;/span&gt; preprocess_function &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;tolower&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
             tokenizer &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; word_tokenizer&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; chunks_number &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; progessbar &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;F&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# using unigrams here&lt;/span&gt;
t1 &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;Sys.time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
vocab &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; vocabulary&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;it&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; ngram &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1L&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;1L&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;kp&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;difftime&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;Sys.time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; t1&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; units &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;sec&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;## Time difference of 3.587275 secs&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;c1&quot;&gt;# str(vocab, nchar.max = 20, width = 80, strict.width = &amp;#39;wrap&amp;#39;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Now we can costruct DTM. Again, since all functions related to &lt;em&gt;corpus&lt;/em&gt; construction have streaming API, we have to create &lt;em&gt;iterator&lt;/em&gt; and provide it to &lt;code&gt;create_vocab_corpus&lt;/code&gt; function:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;it &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; itoken&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;movie_review&lt;span class=&quot;p&quot;&gt;[[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;review&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]],&lt;/span&gt; preprocess_function &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;tolower&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
             tokenizer &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; word_tokenizer&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; chunks_number &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; progessbar &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;F&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
corpus &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; create_vocab_corpus&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;it&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; vocabulary &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; vocab&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
dtm &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; get_dtm&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;corpus&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We got DTM matrix. Lets check its dimension:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;kp&quot;&gt;dim&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;dtm&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;## [1] 25000 85752&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;As you can see, it has 25000 rows (equal to number of documents) and 85752 columns (equal to number of unique terms).
Now we are ready to fit our first model. Here we will use &lt;code&gt;glmnet&lt;/code&gt; package to fit &lt;em&gt;logistic regression&lt;/em&gt; with &lt;em&gt;L1&lt;/em&gt; penalty.&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;kn&quot;&gt;library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;glmnet&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
t1 &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;Sys.time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
fit &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; cv.glmnet&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;x &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; dtm&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; y &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; movie_review&lt;span class=&quot;p&quot;&gt;[[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;sentiment&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]],&lt;/span&gt; 
                 family &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;binomial&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
                 &lt;span class=&quot;c1&quot;&gt;# lasso penalty&lt;/span&gt;
                 alpha &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                 &lt;span class=&quot;c1&quot;&gt;# interested area unded ROC curve&lt;/span&gt;
                 type.measure &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;auc&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                 &lt;span class=&quot;c1&quot;&gt;# 5-fold cross-validation&lt;/span&gt;
                 nfolds &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                 &lt;span class=&quot;c1&quot;&gt;# high value, less accurate, but faster training&lt;/span&gt;
                 thresh &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;1e-3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                 &lt;span class=&quot;c1&quot;&gt;# again lower number iterations for faster training&lt;/span&gt;
                 &lt;span class=&quot;c1&quot;&gt;# in this vignette&lt;/span&gt;
                 maxit &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;1e3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;kp&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;difftime&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;Sys.time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; t1&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; units &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;sec&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;## Time difference of 42.67177 secs&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;plot&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;fit&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img src=&quot;/../images/2015-11-09-text2vec/fit_1-1.png&quot; alt=&quot;center&quot; /&gt;&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;print &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kp&quot;&gt;paste&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;max AUC = &amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;round&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kp&quot;&gt;max&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;fit&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;cvm&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)))&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;## [1] &amp;quot;max AUC =  0.9457&amp;quot;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Note, that training time is quite high. We can reduce it and also significantly improve accuracy.&lt;/p&gt;

&lt;h3 id=&quot;pruning-vocabulary&quot;&gt;Pruning vocabulary&lt;/h3&gt;

&lt;p&gt;We will prune our vocabulary. For example we can find words &lt;em&gt;“a”&lt;/em&gt;, &lt;em&gt;“the”&lt;/em&gt;, &lt;em&gt;“in”&lt;/em&gt; in almost all documents, but actually they don’t give any useful information. Usually they called &lt;a href=&quot;https://en.wikipedia.org/wiki/Stop_words&quot;&gt;stop words&lt;/a&gt;. But in contrast to them, corpus also contains very &lt;em&gt;uncommon terms&lt;/em&gt;, which contained only in few documents. These terms also useless, because we don’t have sufficient statistics for them. Here we will filter them out:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;c1&quot;&gt;# remove very common and uncommon words&lt;/span&gt;
pruned_vocab &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; prune_vocabulary&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;vocab&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; term_count_min &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
 doc_proportion_max &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;0.5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; doc_proportion_min &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;0.001&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

it &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; itoken&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;movie_review&lt;span class=&quot;p&quot;&gt;[[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;review&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]],&lt;/span&gt; preprocess_function &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;tolower&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
             tokenizer &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; word_tokenizer&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; chunks_number &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; progessbar &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;F&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
corpus &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; create_vocab_corpus&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;it&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; vocabulary &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; pruned_vocab&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
dtm &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; get_dtm&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;corpus&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3 id=&quot;tf-idf&quot;&gt;TF-IDF&lt;/h3&gt;

&lt;p&gt;Also we can (and usually should!) apply &lt;strong&gt;&lt;a href=&quot;https://en.wikipedia.org/wiki/Tf%E2%80%93idf&quot;&gt;TF-IDF&lt;/a&gt; transofrmation&lt;/strong&gt;, which will increase weight for document-specific terms and decrease weight for widely used terms:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;dtm &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; dtm &lt;span class=&quot;o&quot;&gt;%&amp;gt;%&lt;/span&gt; tfidf_transformer&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;## idf scaling matrix not provided, calculating it form input matrix&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;kp&quot;&gt;dim&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;dtm&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;## [1] 25000 10535&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Now, lets fit out model again:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;t1 &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;Sys.time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
fit &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; cv.glmnet&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;x &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; dtm&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; y &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; movie_review&lt;span class=&quot;p&quot;&gt;[[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;sentiment&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]],&lt;/span&gt; 
                 family &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;binomial&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
                 &lt;span class=&quot;c1&quot;&gt;# lasso penalty&lt;/span&gt;
                 alpha &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                 &lt;span class=&quot;c1&quot;&gt;# interested area unded ROC curve&lt;/span&gt;
                 type.measure &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;auc&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                 &lt;span class=&quot;c1&quot;&gt;# 5-fold cross-validation&lt;/span&gt;
                 nfolds &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                 &lt;span class=&quot;c1&quot;&gt;# high value, less accurate, but faster training&lt;/span&gt;
                 thresh &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;1e-3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                 &lt;span class=&quot;c1&quot;&gt;# again lower number iterations for faster training&lt;/span&gt;
                 &lt;span class=&quot;c1&quot;&gt;# in this vignette&lt;/span&gt;
                 maxit &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;1e3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;kp&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;difftime&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;Sys.time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; t1&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; units &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;sec&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;## Time difference of 19.19166 secs&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;plot&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;fit&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img src=&quot;/../images/2015-11-09-text2vec/fit_2-1.png&quot; alt=&quot;center&quot; /&gt;&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;print &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kp&quot;&gt;paste&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;max AUC = &amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;round&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kp&quot;&gt;max&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;fit&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;cvm&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)))&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;## [1] &amp;quot;max AUC =  0.9497&amp;quot;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;As you can seem we obtain faster training, and larger AUC.&lt;/p&gt;

&lt;h3 id=&quot;can-we-do-better&quot;&gt;Can we do better?&lt;/h3&gt;

&lt;p&gt;Also we can try to use &lt;a href=&quot;https://en.wikipedia.org/wiki/N-gram&quot;&gt;ngram&lt;/a&gt;s instead of words.
We will use up to 3-ngrams:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;it &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; itoken&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;movie_review&lt;span class=&quot;p&quot;&gt;[[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;review&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]],&lt;/span&gt; preprocess_function &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;tolower&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
             tokenizer &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; word_tokenizer&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; chunks_number &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; progessbar &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;F&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

t1 &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;Sys.time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
vocab &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; vocabulary&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;it&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; ngram &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1L&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;3L&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;

&lt;span class=&quot;kp&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;difftime&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;Sys.time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; t1&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; units &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;sec&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;## Time difference of 21.42234 secs&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;vocab &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; vocab &lt;span class=&quot;o&quot;&gt;%&amp;gt;%&lt;/span&gt; 
  prune_vocabulary&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;term_count_min &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; doc_proportion_max &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;0.5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; doc_proportion_min &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;0.001&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

it &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; itoken&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;movie_review&lt;span class=&quot;p&quot;&gt;[[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;review&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]],&lt;/span&gt; preprocess_function &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;tolower&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
             tokenizer &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; word_tokenizer&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; chunks_number &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; progessbar &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;F&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

corpus &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; create_vocab_corpus&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;it&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; vocabulary &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; vocab&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;kp&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;difftime&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;Sys.time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; t1&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; units &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;sec&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;## Time difference of 32.06087 secs&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;dtm &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; corpus &lt;span class=&quot;o&quot;&gt;%&amp;gt;%&lt;/span&gt; 
  get_dtm &lt;span class=&quot;o&quot;&gt;%&amp;gt;%&lt;/span&gt; 
  tfidf_transformer&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;## idf scaling matrix not provided, calculating it form input matrix&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;kp&quot;&gt;dim&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;dtm&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;## [1] 25000 48462&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;t1 &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;Sys.time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
fit &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; cv.glmnet&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;x &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; dtm&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; y &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; movie_review&lt;span class=&quot;p&quot;&gt;[[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;sentiment&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]],&lt;/span&gt; 
                 family &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;binomial&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
                 &lt;span class=&quot;c1&quot;&gt;# lasso penalty&lt;/span&gt;
                 alpha &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                 &lt;span class=&quot;c1&quot;&gt;# interested area unded ROC curve&lt;/span&gt;
                 type.measure &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;auc&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                 &lt;span class=&quot;c1&quot;&gt;# 5-fold cross-validation&lt;/span&gt;
                 nfolds &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                 &lt;span class=&quot;c1&quot;&gt;# high value, less accurate, but faster training&lt;/span&gt;
                 thresh &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;1e-3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                 &lt;span class=&quot;c1&quot;&gt;# again lower number iterations for faster training&lt;/span&gt;
                 &lt;span class=&quot;c1&quot;&gt;# in this vignette&lt;/span&gt;
                 maxit &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;1e3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;kp&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;difftime&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;Sys.time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; t1&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; units &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;sec&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;## Time difference of 23.21233 secs&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;plot&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;fit&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img src=&quot;/../images/2015-11-09-text2vec/ngram_dtm_1-1.png&quot; alt=&quot;center&quot; /&gt;&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;print &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kp&quot;&gt;paste&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;max AUC = &amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;round&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kp&quot;&gt;max&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;fit&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;cvm&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)))&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;## [1] &amp;quot;max AUC =  0.9566&amp;quot;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;So improved our model a little bit more. I’m leaving further tuning for the reader.&lt;/p&gt;

&lt;h3 id=&quot;feature-hashing&quot;&gt;Feature hashing&lt;/h3&gt;

&lt;p&gt;If you didn’t hear anything about &lt;strong&gt;Feature hashing&lt;/strong&gt; (or &lt;strong&gt;hashing trick&lt;/strong&gt;), I recommend to start with &lt;a href=&quot;https://en.wikipedia.org/wiki/Feature_hashing&quot;&gt;wikipedia article&lt;/a&gt; and after that review &lt;a href=&quot;http://alex.smola.org/papers/2009/Weinbergeretal09.pdf&quot;&gt;original paper&lt;/a&gt; by Yahoo! research team. This techique is very fast - we don’t perform look up over associative array. But another benefit is very low memory footprint - we can map arbitrary number of features into much more compact space. This method was popularized by Yahoo and widely used in &lt;a href=&quot;https://github.com/JohnLangford/vowpal_wabbit/&quot;&gt;Vowpal Wabbit&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Here I will demonstrate, how to use feature hashing in &lt;strong&gt;text2vec&lt;/strong&gt;:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;t1 &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;Sys.time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;

it &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; itoken&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;movie_review&lt;span class=&quot;p&quot;&gt;[[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;review&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]],&lt;/span&gt; preprocess_function &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;tolower&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
             tokenizer &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; word_tokenizer&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; chunks_number &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; progessbar &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;F&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

fh &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; feature_hasher&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;hash_size &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;**&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;18&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; ngram &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1L&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;3L&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;

corpus &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; create_hash_corpus&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;it&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; feature_hasher &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; fh&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;kp&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;difftime&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;Sys.time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; t1&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; units &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;sec&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;## Time difference of 12.53 secs&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;dtm &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; corpus &lt;span class=&quot;o&quot;&gt;%&amp;gt;%&lt;/span&gt; 
  get_dtm &lt;span class=&quot;o&quot;&gt;%&amp;gt;%&lt;/span&gt; 
  tfidf_transformer&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;## idf scaling matrix not provided, calculating it form input matrix&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;kp&quot;&gt;dim&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;dtm&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;## [1]  25000 262144&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;t1 &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;Sys.time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
fit &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; cv.glmnet&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;x &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; dtm&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; y &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; movie_review&lt;span class=&quot;p&quot;&gt;[[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;sentiment&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]],&lt;/span&gt; 
                 family &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;binomial&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
                 &lt;span class=&quot;c1&quot;&gt;# lasso penalty&lt;/span&gt;
                 alpha &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                 &lt;span class=&quot;c1&quot;&gt;# interested area unded ROC curve&lt;/span&gt;
                 type.measure &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;auc&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                 &lt;span class=&quot;c1&quot;&gt;# 5-fold cross-validation&lt;/span&gt;
                 nfolds &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                 &lt;span class=&quot;c1&quot;&gt;# high value, less accurate, but faster training&lt;/span&gt;
                 thresh &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;1e-3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                 &lt;span class=&quot;c1&quot;&gt;# again lower number iterations for faster training&lt;/span&gt;
                 &lt;span class=&quot;c1&quot;&gt;# in this vignette&lt;/span&gt;
                 maxit &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;1e3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;kp&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;difftime&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;Sys.time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; t1&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; units &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;sec&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;## Time difference of 54.91197 secs&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;plot&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;fit&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img src=&quot;/../images/2015-11-09-text2vec/hash_dtm-1.png&quot; alt=&quot;center&quot; /&gt;&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;print &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kp&quot;&gt;paste&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;max AUC = &amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;round&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kp&quot;&gt;max&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;fit&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;cvm&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)))&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;## [1] &amp;quot;max AUC =  0.947&amp;quot;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;As you can see, we got a little bit worse AUC, but DTM construction time was considerably lower. On large collections of documents this can become a serious argument.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>Introducing tmlite - new framework for text mining in R</title>
   <link href="http://dsnotes.com/blog/text2vec/2015/09/16/tmlite-intro"/>
   <updated>2015-09-16T00:00:00+00:00</updated>
   <id>http://dsnotes.com/blog/text2vec/2015/09/16/tmlite-intro</id>
   <content type="html">
&lt;h1 id=&quot;important-note&quot;&gt;IMPORTANT NOTE&lt;/h1&gt;
&lt;p&gt;&lt;strong&gt;Code from this post is outdated (package APIs were changed).&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;See &lt;a href=&quot;http://dsnotes.com/blog/2015/11/09/text2vec/&quot;&gt;this post&lt;/a&gt;.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Today I am pleased to present &lt;a href=&quot;https://github.com/dselivanov/tmlite&quot;&gt;tmlite&lt;/a&gt; - small, but fast and robust package for text-mining tasks in R. It is not availible yet on CRAN, but you can install it directly from github:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;devtools&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;install_github&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;dselivanov/tmlite&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Reasonable question is - why new package? R already has such great package as &lt;a href=&quot;https://cran.r-project.org/web/packages/tm/&quot;&gt;tm&lt;/a&gt; and companion packages &lt;a href=&quot;https://cran.r-project.org/web/packages/tau/&quot;&gt;tau&lt;/a&gt; and &lt;a href=&quot;https://cran.r-project.org/web/packages/NLP/&quot;&gt;NLP&lt;/a&gt;?&lt;/p&gt;

&lt;p&gt;I’ll try to answer these questions in the &lt;a href=&quot;#reasons-why-i-started-develop-tmlite&quot;&gt;last part of the post&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;focus&quot;&gt;Focus&lt;/h2&gt;

&lt;p&gt;As unix philosophy says - &lt;a href=&quot;https://en.wikipedia.org/wiki/Unix_philosophy#Do_One_Thing_and_Do_It_Well&quot;&gt;Do One Thing and Do It Well&lt;/a&gt;, so we will focus on one particular problem - infrastructure for text analysis. R ecosystem contains lots of packages that are well suited for working with sparse high-dimensional data (and thus suitable for text modeling). Here are my favourites:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://cran.r-project.org/web/packages/lda/index.html&quot;&gt;lda&lt;/a&gt; blazing fast package for topic modeling.&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://cran.r-project.org/web/packages/glmnet/index.html&quot;&gt;glmnet&lt;/a&gt; for L1, L2 linear models.&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://cran.r-project.org/web/packages/xgboost/&quot;&gt;xgboost&lt;/a&gt; for gradient boosting.&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://cran.r-project.org/web/packages/LiblineaR/index.html&quot;&gt;LiblineaR&lt;/a&gt; - wrapper of &lt;code&gt;liblinear&lt;/code&gt; svm library.&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://cran.r-project.org/web/packages/irlba/index.html&quot;&gt;irlba&lt;/a&gt; - A fast and memory-efficient method for computing a few approximate singular values and singular vectors of large matrices.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are all excellent and very efficient packages, so &lt;strong&gt;tmlite&lt;/strong&gt; will be focused (at least in the nearest future) not on modeling, but on framework - Document-Matrix construction and manipulation - basis for any text-mining analysis. &lt;strong&gt;tmlite&lt;/strong&gt; is partially inspired by &lt;a href=&quot;https://radimrehurek.com/gensim/&quot;&gt;gensim&lt;/a&gt; - robust and well designed python library for text mining. In the near future we will try to replicate some of its functionality.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/dselivanov/tmlite&quot;&gt;tmlite&lt;/a&gt; is &lt;strong&gt;designed for practitioners&lt;/strong&gt; (and kagglers!) who:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;understand what they want and how to do that. So we will not expose trivial high-level API like &lt;code&gt;findAssocs&lt;/code&gt;, &lt;code&gt;findFreqTerms&lt;/code&gt;, etc.&lt;/li&gt;
  &lt;li&gt;work with medium to large collections of documents&lt;/li&gt;
  &lt;li&gt;have at least medium level of experience in R and know basic concepts of functional programming&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;key-features&quot;&gt;Key features&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Note that package is in very alpha version. This doesn’t mean the package is not robust, but this means that API can change at any time.&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Flexible and easy functional-style API. Easy chaining.&lt;/li&gt;
  &lt;li&gt;Efficient and &lt;strong&gt;memory-friendly streaming corpus construction&lt;/strong&gt;. &lt;strong&gt;tmlite&lt;/strong&gt;’s provides API for construction corporas from &lt;code&gt;character&lt;/code&gt; vectors and more important - &lt;code&gt;connection&lt;/code&gt;s.  &lt;a href=&quot;https://stat.ethz.ch/R-manual/R-devel/library/base/html/connections.html&quot;&gt;Read more about connections here&lt;/a&gt;. So it is possible (and easy!) to construct Document-Term matrices for collections of documents thar are don’t fit in the memory.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Fast&lt;/strong&gt; - core functions are written in C++, thanks to &lt;a href=&quot;https://cran.r-project.org/web/packages/Rcpp/index.html&quot;&gt;Rcpp&lt;/a&gt; authors.&lt;/li&gt;
  &lt;li&gt;Has two main corpus classes -
    &lt;ul&gt;
      &lt;li&gt;&lt;code&gt;DictCorpus&lt;/code&gt; - traditional dictionary-based container used for Document-Term matrix construction.&lt;/li&gt;
      &lt;li&gt;
        &lt;p&gt;&lt;code&gt;HashCorpus&lt;/code&gt; - container that implements &lt;a href=&quot;https://en.wikipedia.org/wiki/Feature_hashing&quot;&gt;feature hashing&lt;/a&gt; or &lt;strong&gt;“hashing trick”&lt;/strong&gt;. Similar to &lt;a href=&quot;http://scikit-learn.org/stable/modules/feature_extraction.html#feature-hashing&quot;&gt;scikit-learn FeatureHasher&lt;/a&gt; and  &lt;a href=&quot;https://radimrehurek.com/gensim/corpora/hashdictionary.html&quot;&gt;gensim corpora.hashdictionary&lt;/a&gt;.&lt;/p&gt;

        &lt;blockquote&gt;
          &lt;p&gt;The class HashCorpus is a high-speed, low-memory vectorizer that uses a technique known as feature hashing, or the “hashing trick”. Instead of building a hash table of the features encountered in training, as the vectorizers do, instances of HashCorpus apply a hash function to the features to determine their column index in sample matrices directly.&lt;/p&gt;
        &lt;/blockquote&gt;
      &lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;Document-Term matrix is key object. At the moment it can be extracted from corpus into &lt;code&gt;dgCMatrix&lt;/code&gt;, &lt;code&gt;dgTMatrix&lt;/code&gt; or &lt;a href=&quot;https://www.cs.princeton.edu/~blei/lda-c/readme.txt&quot;&gt;LDA-C&lt;/a&gt; which is standart for &lt;a href=&quot;https://cran.r-project.org/web/packages/lda/index.html&quot;&gt;lda&lt;/a&gt; package. &lt;code&gt;dgCMatrix&lt;/code&gt; is default for sparse matrices in R and most of the packages that work with sparse matrices work with &lt;code&gt;dgCMatrix&lt;/code&gt; matrices, so it will be easy to interact with them.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;quick-reference&quot;&gt;Quick reference&lt;/h2&gt;
&lt;p&gt;First quick example is based on kaggle’s &lt;a href=&quot;https://www.kaggle.com/c/word2vec-nlp-tutorial&quot;&gt;Bag of Words Meets Bags of Popcorn&lt;/a&gt; competition data - &lt;a href=&quot;https://www.kaggle.com/c/word2vec-nlp-tutorial/download/labeledTrainData.tsv.zip&quot;&gt;labeledTrainData.tsv.zip&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Here I’ll demostrate flexibility of the corpus creation procedure and how to vectorize large collection of documents.&lt;/p&gt;

&lt;p&gt;Suppose text file is very large, but it contains 3 tab-separated columns. Only one is relevant (third column in example below). Now we want to create corpus, but can’t read whole file into memory. See how this will be resolved. 
First load libraries:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;kn&quot;&gt;library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;methods&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;tmlite&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;## Loading required package: Matrix&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;c1&quot;&gt;# for pipe syntax&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;magrittr&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;File contains 3 columns - &lt;code&gt;id&lt;/code&gt;, &lt;code&gt;sentiment&lt;/code&gt;, &lt;code&gt;review&lt;/code&gt;. Only &lt;code&gt;review&lt;/code&gt; is relevant.&lt;/p&gt;

&lt;p&gt;Simple preprocessing function will do the trick for us - we will only read third column - text of the review.&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;c1&quot;&gt;# function receives character vector - batch of rows.&lt;/span&gt;
preprocess_fun &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;kr&quot;&gt;function&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;x&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;# file is tab-sepatated - split each row by \t&lt;/span&gt;
  rows &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;strsplit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;x&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;\t&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; fixed &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;# text review is in the third column&lt;/span&gt;
  txt &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;sapply&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;rows&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kr&quot;&gt;function&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;x&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; x&lt;span class=&quot;p&quot;&gt;[[&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]])&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;# tolower, keep only letters&lt;/span&gt;
  simple_preprocess&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;txt&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; 
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Read documents and create &lt;strong&gt;dictionary-based&lt;/strong&gt; corpus:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;c1&quot;&gt;# we don&amp;#39;t want read all file into RAM - we will read it iteratively, row by row&lt;/span&gt;
path &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;~/Downloads/labeledTrainData.tsv&amp;#39;&lt;/span&gt;
con &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;path&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; open &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;r&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; blocking &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;F&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
corp &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; create_dict_corpus&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;src &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; con&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
                   preprocess_fun &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; preprocess_fun&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
                   &lt;span class=&quot;c1&quot;&gt;# simple_tokenizer - split string by whitespace&lt;/span&gt;
                   tokenizer &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; simple_tokenizer&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
                   &lt;span class=&quot;c1&quot;&gt;# read by batch of 1000 documents&lt;/span&gt;
                   batch_size &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;1000&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                   &lt;span class=&quot;c1&quot;&gt;# skip first row - header&lt;/span&gt;
                   skip &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
                   &lt;span class=&quot;c1&quot;&gt;# do not show progress bar because of knitr&lt;/span&gt;
                   progress &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;F&lt;/span&gt;
                  &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Now we want to try predict sentiment, based on review. For that we will use &lt;strong&gt;glmnet&lt;/strong&gt; package, so we have to create Document-Term matrix in &lt;code&gt;dgCMatrix&lt;/code&gt; format. It is easy with &lt;code&gt;get_dtm&lt;/code&gt; function:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;dtm &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; get_dtm&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;corpus &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; corp&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; type &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;dgCMatrix&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;%&amp;gt;%&lt;/span&gt; 
  &lt;span class=&quot;c1&quot;&gt;# remove very common and very uncommon words&lt;/span&gt;
  dtm_transform&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;filter_commons_transformer&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; term_freq &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;common &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;0.001&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; uncommon &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;0.975&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;%&amp;gt;%&lt;/span&gt; 
  &lt;span class=&quot;c1&quot;&gt;# make tf-idf transformation&lt;/span&gt;
  dtm_transform&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;tfidf_transformer&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;kp&quot;&gt;dim&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;dtm&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;## [1] 25000 10067&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Cool. We have feature matrix, but don’t have response variable, which is still in the large file (which possibly won’t fit into memory). Fortunately reading particular columns is easy, for example see this &lt;a href=&quot;http://stackoverflow.com/questions/2193742/ways-to-read-only-select-columns-from-a-file-into-r-a-happy-medium-between-re&quot;&gt;stackoverflow discussion&lt;/a&gt;. We will use &lt;code&gt;fread()&lt;/code&gt; function from &lt;strong&gt;data.table&lt;/strong&gt; package:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;kn&quot;&gt;library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;data.table&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# read only second column - value of sentiment&lt;/span&gt;
dt &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; fread&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;path&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; select &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;So all stuff is ready for model fitting.&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;kn&quot;&gt;library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;glmnet&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;## Loading required package: foreach
## Loaded glmnet 2.0-2&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;c1&quot;&gt;# I have 4 core machine, so will use parallel backend for n-fold crossvalidation&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;doParallel&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;## Loading required package: iterators
## Loading required package: parallel&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;registerDoParallel&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# train logistic regression with 4-fold cross-validation, maximizing AUC&lt;/span&gt;
fit &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; cv.glmnet&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;x &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; dtm&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; y &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; dt&lt;span class=&quot;p&quot;&gt;[[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;sentiment&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]],&lt;/span&gt; 
                 family &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;binomial&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; type.measure &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;auc&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
                 nfolds &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; parallel &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
plot&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;fit&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img src=&quot;/../images/2015-09-16-tmlite-intro/dict_dtm_fit-1.png&quot; alt=&quot;center&quot; /&gt;&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;print &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kp&quot;&gt;paste&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;max AUC = &amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;round&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kp&quot;&gt;max&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;fit&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;cvm&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)))&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;## [1] &amp;quot;max AUC =  0.9483&amp;quot;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Not bad!
Now lets try to construct &lt;strong&gt;dtm&lt;/strong&gt; using &lt;code&gt;HashCorpus&lt;/code&gt; class. Our data is tiny, but for larger data or streaming environments, &lt;code&gt;HashCorpus&lt;/code&gt;  is natural choice. Read documents and create &lt;strong&gt;hash-based&lt;/strong&gt; corpus:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;con &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;path&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; open &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;r&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; blocking &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;F&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
hash_corp &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; create_hash_corpus&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;src &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; con&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
                           preprocess_fun &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; preprocess_fun&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
                           &lt;span class=&quot;c1&quot;&gt;# simple_tokenizer - split string by whitespace&lt;/span&gt;
                           tokenizer &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; simple_tokenizer&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
                           &lt;span class=&quot;c1&quot;&gt;# read by batch of 1000 documents&lt;/span&gt;
                           batch_size &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;1000&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                           &lt;span class=&quot;c1&quot;&gt;# skip first row - header&lt;/span&gt;
                           skip &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                           &lt;span class=&quot;c1&quot;&gt;# don&amp;#39;t show progress bar because of knitr&lt;/span&gt;
                           progress &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;F&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
hash_dtm &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; get_dtm&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;corpus &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; hash_corp&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; type &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;dgCMatrix&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;%&amp;gt;%&lt;/span&gt; 
  dtm_transform&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;filter_commons_transformer&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; term_freq &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;common &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;0.001&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; uncommon &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;0.975&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;%&amp;gt;%&lt;/span&gt; 
  dtm_transform&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;tfidf_transformer&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# note, that ncol(hash_dtm) &amp;gt; ncol(dtm). Effect of collisions - we can fix this by increasing `hash_size` parameter .&lt;/span&gt;
&lt;span class=&quot;kp&quot;&gt;dim&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;hash_dtm&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;## [1] 25000 10107&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;registerDoParallel&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
hash_fit &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; cv.glmnet&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;x &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; hash_dtm&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; y &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; dt&lt;span class=&quot;p&quot;&gt;[[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;sentiment&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]],&lt;/span&gt; 
                      family &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;binomial&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; type.measure &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;auc&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
                      nfolds &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; parallel &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
plot&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;hash_fit&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img src=&quot;/../images/2015-09-16-tmlite-intro/hash_dtm_fit-1.png&quot; alt=&quot;center&quot; /&gt;&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;c1&quot;&gt;# near the same result&lt;/span&gt;
print &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kp&quot;&gt;paste&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;max AUC = &amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;round&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kp&quot;&gt;max&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;hash_fit&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;cvm&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)))&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;## [1] &amp;quot;max AUC =  0.9481&amp;quot;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2 id=&quot;future-work&quot;&gt;Future work&lt;/h2&gt;
&lt;p&gt;Project has &lt;a href=&quot;https://github.com/dselivanov/tmlite/issues&quot;&gt;issue tracker on github&lt;/a&gt; where I’m filing feature requests and notes for future work. Any ideas are very appreciated.&lt;/p&gt;

&lt;p&gt;If you like it, you can &lt;strong&gt;help&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Test and leave feedback on &lt;a href=&quot;https://github.com/dselivanov/tmlite/issues&quot;&gt;github issuer tracker&lt;/a&gt; (preferably) or directly by email.
    &lt;ul&gt;
      &lt;li&gt;package is tested on linux and OS X platforms, so Windows users are especially welcome&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;Fork and start contributing. Vignettes, docs, tests, use cases are very welcome.&lt;/li&gt;
  &lt;li&gt;Or just give me a star on &lt;a href=&quot;https://github.com/dselivanov/tmlite&quot;&gt;project page&lt;/a&gt; :-)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;short-term-plans&quot;&gt;Short-term plans&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;add tests&lt;/li&gt;
  &lt;li&gt;add n-gram tokenizers&lt;/li&gt;
  &lt;li&gt;add methods for tokenization in C++ (at the moment tokenization takes almost half of runtime)&lt;/li&gt;
  &lt;li&gt;switch to murmur3 hash and add second hash function to reduce probability of collision&lt;/li&gt;
  &lt;li&gt;push dictionary and stopwords filtering into C++ code&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;middle-term-plans&quot;&gt;Middle-term plans&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;add &lt;strong&gt;&lt;a href=&quot;https://code.google.com/p/word2vec/&quot;&gt;word2vec&lt;/a&gt; wrapper&lt;/strong&gt;. It is strange, that R community still didn’t have it.&lt;/li&gt;
  &lt;li&gt;add corpus serialization&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;long-term-plans&quot;&gt;Long-term plans&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;integrate models like it is done in &lt;a href=&quot;https://radimrehurek.com/gensim/&quot;&gt;gensim&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;try to implement out-of-core transformations like &lt;a href=&quot;https://radimrehurek.com/gensim/&quot;&gt;gensim&lt;/a&gt; does&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;reasons-why-i-started-develop-tmlite&quot;&gt;Reasons why I started develop tmlite&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;All conslusions below are based on personal experience so they can be heavily biased.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;First time I started to use  &lt;strong&gt;tm&lt;/strong&gt; was end of 2014. I tried to process collection of text dosuments which was less then 1 Gb. About 10000 texts. Surprisingly I wasn’t able to process them on machine with 16 Gb of RAM! But what is really cool - R and all the packages are open source. So I started to examine source code. Unfortunatelly I ended by rewriting most of the package. That first version (anyone interested can browse commits history on github) was quite robust and can handle such tiny-to-medium collections of documents. After that I tried it on some kaggle competitions, but didn’t do any new development, since my work wasn’t related to text analysis and I had no time for that. Also I noted, that almost all text-mining packages in R has &lt;strong&gt;tm&lt;/strong&gt; dependency. We will try to develop an alternative.&lt;/p&gt;

&lt;p&gt;About month ago I started full redesign (based on previous experience) and now I rewrote core functions in C++ and want bring alpha version to community.&lt;/p&gt;

&lt;p&gt;So why you should not use &lt;strong&gt;tm&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;tm&lt;/strong&gt; has a lot of functions - in fact reference manual contains more than 50 pages. But its &lt;strong&gt;API is very messy&lt;/strong&gt;. A lot of packages depends on it , so it is hard redesign it.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;tm&lt;/strong&gt; is not very efficient (from my experience). I found it &lt;strong&gt;very slow&lt;/strong&gt; and what is more important - &lt;strong&gt;very RAM unfriendly and RAM-greedy&lt;/strong&gt;. (I’ll provide few examples below). As I understand it is designed more for academia researchers, then data science practitioners. It perfectly handles metadata, processes different encodings. API is very high-level, but the price for that is performance.&lt;/li&gt;
  &lt;li&gt;Can only &lt;strong&gt;handle documents that fit in RAM&lt;/strong&gt;. (To be fair I should say, that there is &lt;code&gt;PCorpus()&lt;/code&gt; function. But it seems it cannot help with Document-Term matrix construction when size of the documents larger than RAM - see examples below. &lt;code&gt;DocumentTermMatrix()&lt;/code&gt; is very RAM-greedy).&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;comparison-with-tm&quot;&gt;Comparison with tm&lt;/h2&gt;

&lt;h3 id=&quot;some-naive-benchmarks-on-document-trem-matrix-construction&quot;&gt;Some naive benchmarks on Document-Trem matrix construction&lt;/h3&gt;

&lt;p&gt;Here I’ll provide simple benchmark, which can give some impression about &lt;strong&gt;tmlite&lt;/strong&gt; speed, compared to &lt;strong&gt;tm&lt;/strong&gt;. For now we assume, that documents are already in memory, so we only need to clean text and tokenize it:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;kn&quot;&gt;library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;tm&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;## Loading required package: NLP&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;kn&quot;&gt;library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;data.table&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;tmlite&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
dt &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; fread&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;~/Downloads/labeledTrainData.tsv&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
txt &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; dt&lt;span class=&quot;p&quot;&gt;[[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;review&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]]&lt;/span&gt;
&lt;span class=&quot;kp&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;object.size&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;txt&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; quote &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kc&quot;&gt;FALSE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; units &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;Mb&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;## 32.8 Mb&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;c1&quot;&gt;# 32.8 Mb&lt;/span&gt;
system.time &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt; corpus_tm &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; VCorpus&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;VectorSource&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;txt&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;##    user  system elapsed 
##   2.081   0.011   2.095&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;kp&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;object.size&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;corpus_tm&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; quote &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kc&quot;&gt;FALSE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; units &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;Mb&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;## 121.4 Mb&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;c1&quot;&gt;# 121.4 Mb!!!&lt;/span&gt;
system.time &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt; corpus_tm &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; tm_map&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;corpus_tm&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; content_transformer&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;simple_preprocess&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;##    user  system elapsed 
##  10.761   0.281   6.591&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;system.time &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt; dtm_tm &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; DocumentTermMatrix&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;corpus_tm&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; control &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;list&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;tokenize &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; words&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;##    user  system elapsed 
##  15.002   0.740  12.227&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Now lets check timings for &lt;strong&gt;tmlite&lt;/strong&gt;:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;system.time &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt; corp &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; create_dict_corpus&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;src &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; txt&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
                   preprocess_fun &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; simple_preprocess&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
                   &lt;span class=&quot;c1&quot;&gt;# simple_tokenizer - split string by whitespace&lt;/span&gt;
                   tokenizer &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; simple_tokenizer&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
                   &lt;span class=&quot;c1&quot;&gt;# read by batch of 5000 documents&lt;/span&gt;
                   batch_size &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;5000&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
                   &lt;span class=&quot;c1&quot;&gt;# do not show progress bar because of knitr&lt;/span&gt;
                   progress &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kc&quot;&gt;FALSE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;##    user  system elapsed 
##  10.127   0.079  10.224&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;c1&quot;&gt;# get in dgTMatrix form, because tm stores dtm matrix in triplet form&lt;/span&gt;
system.time &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt; dtm &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; get_dtm&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;corpus &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; corp&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; type &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;dgTMatrix&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;##    user  system elapsed 
##   0.042   0.008   0.050&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Well, &lt;strong&gt;only two times faster&lt;/strong&gt;. Is it worth the effort? Lets check another example. Here we will use &lt;a href=&quot;https://d396qusza40orc.cloudfront.net/mmds/datasets/sentences.txt.zip&quot;&gt;data&lt;/a&gt; from excellent &lt;a href=&quot;https://www.coursera.org/course/mmds&quot;&gt;Mining massive datasets&lt;/a&gt; course. This is quite a large collection of short texts - more than 9 million rows, 500Mb zipped and about 1.4Gb unzipped.&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;c1&quot;&gt;# we will read only small fraction - 200000 rows (~ 42Mb)&lt;/span&gt;
txt &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;readLines&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;~/Downloads/sentences.txt&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; n &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;2e5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;kp&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;object.size&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;txt&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; quote &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kc&quot;&gt;FALSE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; units &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;Mb&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;## 41.7 Mb&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;c1&quot;&gt;# 41.7 Mb&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# VCorpus is very slow, about 20 sec on my computer&lt;/span&gt;
system.time &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt; corpus_tm &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; VCorpus&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;VectorSource&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;txt&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;##    user  system elapsed 
##  19.340   0.204  19.573&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;kp&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;object.size&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;corpus_tm&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; quote &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kc&quot;&gt;FALSE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; units &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;Mb&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;## 749.8 Mb&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;c1&quot;&gt;# 749.8 Mb!!! wow!&lt;/span&gt;
system.time &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt; corpus_tm &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; tm_map&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;corpus_tm&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; content_transformer&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;simple_preprocess&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;##    user  system elapsed 
##  20.629   1.487  29.161&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;c1&quot;&gt;# 26 sec. To process 42 Mb of text.&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;But the following is trully absurd. This forks 2 processes (because it uses mclapply internally). Each process uses 1.3Gb of RAM. &lt;strong&gt;2.6 Gb of RAM to process 42 Mb text chunk&lt;/strong&gt;. And this takes more then 50 sec on my macbook pro with latest core i7 intel chip. In fact it is not possible to process 1 million rows (200Mb) from my macbook pro with 16 gb of RAM.&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;system.time &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt; dtm_tm &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; DocumentTermMatrix&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;corpus_tm&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; control &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;list&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;tokenize &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; words&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;##    user  system elapsed 
##  99.256   3.884  53.380&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Compare with &lt;strong&gt;tmlite&lt;/strong&gt;:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;system.time &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt; corp &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; create_dict_corpus&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;src &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; txt&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
                   preprocess_fun &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; simple_preprocess&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
                   &lt;span class=&quot;c1&quot;&gt;# simple_tokenizer - split string by whitespace&lt;/span&gt;
                   tokenizer &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; simple_tokenizer&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
                   &lt;span class=&quot;c1&quot;&gt;# read by batch of 5000 documents&lt;/span&gt;
                   batch_size &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;5000&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
                   &lt;span class=&quot;c1&quot;&gt;# do not show progress bar because of knitr&lt;/span&gt;
                   progress &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;F&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;##    user  system elapsed 
##  10.025   0.050  10.081&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;c1&quot;&gt;# only around 9 sec and 120 Mb of ram&lt;/span&gt;
system.time &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt; dtm_tmlite &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; get_dtm&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;corpus &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; corp&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; type &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;dgTMatrix&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;##    user  system elapsed 
##   0.116   0.016   0.133&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;c1&quot;&gt;# less than 1 second&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;So here &lt;strong&gt;tmlite 8 times faster&lt;/strong&gt; and what is much more important &lt;strong&gt;consumes 20 times less RAM&lt;/strong&gt;. On large collections of documents speed up will be even more significant.&lt;/p&gt;

&lt;h2 id=&quot;document-term-matrix-manipulations&quot;&gt;Document-Term Matrix manipulations&lt;/h2&gt;
&lt;p&gt;In practice it can be usefull to remove common and uncommon terms. Both packages provide functions for that: &lt;code&gt;removeSparseTerms()&lt;/code&gt; in &lt;strong&gt;tm&lt;/strong&gt; and &lt;code&gt;dtm_remove_common_terms&lt;/code&gt; in &lt;strong&gt;tmlite&lt;/strong&gt;. Also note, that &lt;code&gt;removeSparseTerms()&lt;/code&gt; can only remove uncommon terms, so to be fair we will test only that functionality:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;kp&quot;&gt;system.time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt; dtm_tm_reduced &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; removeSparseTerms&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;dtm_tm&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;0.99&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;##    user  system elapsed 
##   1.422   0.104   1.535&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;c1&quot;&gt;# common = 1 =&amp;gt; do not remove common terms&lt;/span&gt;
&lt;span class=&quot;kp&quot;&gt;system.time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt; dtm_tmlite_reduced &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; dtm_tmlite &lt;span class=&quot;o&quot;&gt;%&amp;gt;%&lt;/span&gt; 
               dtm_transform&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;filter_commons_transformer&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; term_freq &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;common &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;0.001&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; uncommon &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;0.975&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)))&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;##    user  system elapsed 
##   0.350   0.081   0.431&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;3-5 times faster - not bad. 
Now compare tf-idf transformation:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;kp&quot;&gt;system.time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt; dtm_tm_tfidf &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; weightTfIdf&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;dtm_tm&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; normalize &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;## Warning in weightTfIdf(dtm_tm, normalize = T): empty document(s): 6782
## 26135 26136 26137 26138 26139 26140 26141 26142 26143 26144 26145 27664
## 60895 60896 60897 60898 60899 60900 88953 106921 122685 141442 141443
## 141449 141454 152656&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;##    user  system elapsed 
##   0.246   0.028   0.274&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;c1&quot;&gt;# common = 1 =&amp;gt; do not remove common terms&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# timings slightly greate than weightTfIdf, because all transformations optimized for &lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# dgCMatrix format, which is standart for sparse matrices in R&lt;/span&gt;
&lt;span class=&quot;kp&quot;&gt;system.time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt; dtm_tmlite_tfidf &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; dtm_tmlite &lt;span class=&quot;o&quot;&gt;%&amp;gt;%&lt;/span&gt; 
               dtm_transform&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;tfidf_transformer&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;##    user  system elapsed 
##   0.390   0.091   0.481&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;c1&quot;&gt;# for dtm in dgCMatrix timings should be equal&lt;/span&gt;
dtm_tmlite_dgc&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;  as&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;dtm_tmlite&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;dgCMatrix&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;kp&quot;&gt;system.time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt; dtm_tmlite_tfidf &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; dtm_tmlite_dgc &lt;span class=&quot;o&quot;&gt;%&amp;gt;%&lt;/span&gt; 
               dtm_transform&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;tfidf_transformer&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;##    user  system elapsed 
##   0.252   0.049   0.302&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Equal timings -  great (and surprise for me) - within the last year &lt;strong&gt;tm&lt;/strong&gt; authors have significantly improved its performace!&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>Working with MS SQL server on non-windows systems</title>
   <link href="http://dsnotes.com/blog/2015/07/16/r-and-mssql"/>
   <updated>2015-07-16T00:00:00+00:00</updated>
   <id>http://dsnotes.com/blog/2015/07/16/r-and-mssql</id>
   <content type="html">
&lt;p&gt;As I know, there are few choices to connect from R to MS SQL Server:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a href=&quot;https://cran.r-project.org/web/packages/RODBC/index.html&quot;&gt;RODBC&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://cran.r-project.org/web/packages/RJDBC/index.html&quot;&gt;RJDBC&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/agstudy/rsqlserver&quot;&gt;rsqlserver&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;But only second option can be used on &lt;strong&gt;mac&lt;/strong&gt; and &lt;strong&gt;linux&lt;/strong&gt; machines. Here is nice &lt;a href=&quot;http://stackoverflow.com/questions/14513224/connecting-to-ms-sql-server-from-r-on-mac-linux&quot;&gt;stackoverflow thread&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Most of the people suggest to use &lt;a href=&quot;https://www.microsoft.com/en-us/download/confirmation.aspx?id=11774&quot;&gt;microsoft sql java driver&lt;/a&gt;. But there is a case when this will not help - &lt;strong&gt;windows domain authentification&lt;/strong&gt;. In this situation I found the only working solution is to use nice &lt;a href=&quot;http://jtds.sourceforge.net/&quot;&gt;jTDS&lt;/a&gt;. It not only solve this problem, but also &lt;a href=&quot;http://jtds.sourceforge.net/benchTest.html&quot;&gt;outperform&lt;/a&gt; Microsoft JDBC Driver.&lt;/p&gt;

&lt;p&gt;So to use it you have to:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Install &lt;a href=&quot;https://cran.r-project.org/web/packages/rJava/&quot;&gt;rJava&lt;/a&gt;. There are a lot of manuals for diffrent OS on the internet.&lt;/li&gt;
  &lt;li&gt;Install &lt;a href=&quot;https://cran.r-project.org/web/packages/RJDBC/&quot;&gt;RJDBC&lt;/a&gt;.&lt;/li&gt;
  &lt;li&gt;Download jTDS from &lt;a href=&quot;http://sourceforge.net/projects/jtds/files/&quot;&gt;official site&lt;/a&gt;. Unpack it.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Now you can easily connect to your source:&lt;br /&gt;
&lt;em&gt;(assume jtds-1.3.1, which is unpacked into ~/bin )&lt;/em&gt;&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;drv &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; JDBC&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;net.sourceforge.jtds.jdbc.Driver&amp;quot;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
            &lt;span class=&quot;s&quot;&gt;&amp;quot;~/bin/jtds-1.3.1-dist/jtds-1.3.1.jar&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
mssql_addr &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;10.0.0.1&amp;quot;&lt;/span&gt;
mssql_port &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;1433&amp;quot;&lt;/span&gt;
domain &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;YOUR_DOMAIN&amp;quot;&lt;/span&gt;
connection_string &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;paste0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;jdbc:jtds:sqlserver://&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; mssql_addr&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;:&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; mssql_port&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
                            &lt;span class=&quot;s&quot;&gt;&amp;quot;;domain=&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; domain&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
conn &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; dbConnect&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;drv&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
                  connection_string&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
                  user &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;user_name&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
                  password &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;********&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
query &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;select count(*) from your_db.dbo.your_table&amp;quot;&lt;/span&gt;
cnt &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; dbGetQuery&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;conn &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; conn&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; statement &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; query&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

</content>
 </entry>
 
 <entry>
   <title>Installing cuda toolkit and related R packages</title>
   <link href="http://dsnotes.com/blog/2015/06/04/installing-cuda-toolkit-and-gputools"/>
   <updated>2015-06-04T00:00:00+00:00</updated>
   <id>http://dsnotes.com/blog/2015/06/04/installing-cuda-toolkit-and-gputools</id>
   <content type="html">
&lt;p&gt;The main purpose of this post is to keep all steps of installing cuda toolkit (and R related packages) and in one place. Also I hope this may be useful for someone.&lt;/p&gt;

&lt;h2 id=&quot;installing-cuda-toolkit--ubuntu-&quot;&gt;Installing cuda toolkit ( Ubuntu )&lt;/h2&gt;
&lt;p&gt;First of all we need to install &lt;strong&gt;nvidia cuda toolkti&lt;/strong&gt;. I’am on latest ubuntu 15.04, but found &lt;a href=&quot;http://www.r-tutor.com/gpu-computing/cuda-installation/cuda7.0-ubuntu&quot;&gt;this article&lt;/a&gt; well suited for me. But there are few additions:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;It is very important to have no nvidia drivers before installation ( first I corrupted my system and have to reinstall it :-( ). So I recommend to switch to real terminal (&lt;code&gt;ctrl + alt + f1&lt;/code&gt;), remove all nvidia stuff &lt;code&gt;sudo apt-get purge nvidia-*&lt;/code&gt; and then follow steps from article above.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;This will install cuda toolkit and corresponding nvidia drivers.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;wget http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1410/x86_64/cuda-repo-ubuntu1410_7.0-28_amd64.deb
sudo dpkg -i cuda-repo-ubuntu1410_7.0-28_amd64.deb
sudo apt-get update
sudo apt-get install cuda&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;ol&gt;
  &lt;li&gt;After installation we need to modify our &lt;code&gt;.bashrc&lt;/code&gt; file. Add following lines:&lt;/li&gt;
&lt;/ol&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;nb&quot;&gt;export &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;CUDA_HOME&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;/usr/local/cuda-7.0
&lt;span class=&quot;nb&quot;&gt;export &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;LD_LIBRARY_PATH&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;CUDA_HOME&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt;/lib64

&lt;span class=&quot;nv&quot;&gt;PATH&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;CUDA_HOME&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt;/bin:&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;PATH&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;PATH&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;CUDA_HOME&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt;/bin/nvcc:&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;PATH&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;export &lt;/span&gt;PATH&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Note, that I added path to &lt;code&gt;nvcc&lt;/code&gt; compiler.&lt;/p&gt;

&lt;h2 id=&quot;installing-gputools&quot;&gt;Installing gputools&lt;/h2&gt;
&lt;p&gt;First simply try:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;install.packages&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;gputools&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; repos &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;http://cran.rstudio.com/&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;After that I recieved:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Unsupported gpu architecture ‘compute_10’&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Solving this issue I found this &lt;a href=&quot;https://devtalk.nvidia.com/default/topic/606195/-solved-nvcc-fatal-unsupported-gpu-architecture-compute_21-/&quot;&gt;link&lt;/a&gt; useful. 
I have gt525m card and have compute capability 2.1. You can verify your GPU capabilities &lt;a href=&quot;https://developer.nvidia.com/cuda-gpus&quot;&gt;here&lt;/a&gt;. 
So I downloaded gputools source package:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;nb&quot;&gt;cd&lt;/span&gt; ~
wget http://cran.r-project.org/src/contrib/gputools_0.28.tar.gz
tar -zxvf gputools_0.28.tar.gz&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;and replace following string&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;NVCC :&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;$(&lt;/span&gt;CUDA_HOME&lt;span class=&quot;k&quot;&gt;)&lt;/span&gt;/bin/nvcc -gencode &lt;span class=&quot;nv&quot;&gt;arch&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;compute_10,code&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;sm_10 -gencode &lt;span class=&quot;nv&quot;&gt;arch&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;compute_13,code&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;sm_13 -gencode &lt;span class=&quot;nv&quot;&gt;arch&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;compute_20,code&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;sm_20 -gencode &lt;span class=&quot;nv&quot;&gt;arch&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;compute_30,code&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;sm_30&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;in &lt;code&gt;gputools/src/Makefile&lt;/code&gt; by&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;NVCC :&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;$(&lt;/span&gt;CUDA_HOME&lt;span class=&quot;k&quot;&gt;)&lt;/span&gt;/bin/nvcc -gencode &lt;span class=&quot;nv&quot;&gt;arch&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;compute_20,code&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;sm_21&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Next try to gzip it back and install from source:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;install.packages&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;~/gputools.tar.gz&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; repos &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kc&quot;&gt;NULL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; type &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;source&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Than I recieved:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;rinterface.cu:1:14: fatal error: R.h: No such file or directory #include&lt;r.h&gt;&lt;/r.h&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;We have to adjust R header dir location. First of all look for &lt;code&gt;R.h&lt;/code&gt;:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;locate &lt;span class=&quot;se&quot;&gt;\/&lt;/span&gt;R.h&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;replace string &lt;code&gt;R_INC := $(R_HOME)/include&lt;/code&gt; in &lt;code&gt;gputools/src/config.mk&lt;/code&gt; string by found path:
&lt;code&gt;
R_INC := /usr/share/R/include
&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;In case we recieve error regarding shared &lt;code&gt;libcublas.so&lt;/code&gt; we also need to adjust links for &lt;code&gt;libcublas&lt;/code&gt; shared library:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;sudo ln -s /usr/local/cuda/lib64/libcublas.so.7.0 /usr/lib/libcublas.so.7.0&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;thanks to this &lt;a href=&quot;http://stackoverflow.com/questions/10808958/why-cant-libcudart-so-4-be-found-when-compiling-the-cuda-samples-under-ubuntu&quot;&gt;thread&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;testing-performance&quot;&gt;Testing performance&lt;/h2&gt;
&lt;p&gt;here is simple benchmark:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;kn&quot;&gt;library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;gputools&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
N &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;1e3&lt;/span&gt;
m &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;matrix&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kp&quot;&gt;sample&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;100&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; size &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; N&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;N&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; replace &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; nrow &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; N&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;kp&quot;&gt;system.time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;dist&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;m&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;##    user  system elapsed 
##   4.864   0.008   4.874&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;kp&quot;&gt;system.time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;gpuDist&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;m&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;##    user  system elapsed 
##   0.640   0.168   0.809&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

</content>
 </entry>
 
 <entry>
   <title>Locality Sensitive Hashing In R Part 1</title>
   <link href="http://dsnotes.com/blog/2015/01/02/locality-sensitive-hashing-in-r-part-1"/>
   <updated>2015-01-02T00:00:00+00:00</updated>
   <id>http://dsnotes.com/blog/2015/01/02/locality-sensitive-hashing-in-r-part-1</id>
   <content type="html">
&lt;h2 id=&quot;introduction&quot;&gt;Introduction&lt;/h2&gt;
&lt;p&gt;In the next series of posts I will try to explain base concepts &lt;strong&gt;Locality Sensitive Hashing technique&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Note, that I will try to follow general functional programming style. So I will use R’s &lt;a href=&quot;https://stat.ethz.ch/R-manual/R-devel/library/base/html/funprog.html&quot;&gt;Higher-Order Functions&lt;/a&gt; instead of traditional &lt;strong&gt;R’s &lt;em&gt;*apply&lt;/em&gt;&lt;/strong&gt; functions family (I suppose this post will be more readable for non R users). Also I will use &lt;strong&gt;brilliant pipe operator&lt;/strong&gt; &lt;code&gt;%&amp;gt;%&lt;/code&gt; from &lt;a href=&quot;http://cran.r-project.org/web/packages/magrittr/&quot;&gt;magrittr&lt;/a&gt; package. We will start with basic concepts, but end with very efficient implementation in R (it is about 100 times faster than python implementations I found).&lt;/p&gt;

&lt;h2 id=&quot;the-problem&quot;&gt;The problem&lt;/h2&gt;
&lt;p&gt;Imagine the following interesting problem. We have two &lt;strong&gt;very large&lt;/strong&gt; social netwotks (for example &lt;strong&gt;facebook and google+&lt;/strong&gt;), which have hundreds of millions of profiles and we want to determine profiles owned by same person. One reasonable approach is to assume that these people have nearly same or at least highly overlapped sets of friends in both networks. One well known measure for determining degree of similarity of sets is &lt;a href=&quot;http://en.wikipedia.org/wiki/Jaccard_index&quot;&gt;Jaccard Index&lt;/a&gt;:&lt;br /&gt;
&lt;script type=&quot;math/tex&quot;&gt;J(SET_1, SET_2) = {|SET_1 \cap SET_2|\over |SET_1 \cup SET_2| }&lt;/script&gt;&lt;/p&gt;

&lt;p&gt;Set operations are computationally cheap and straightforward solution seems quite good. But let’s try to estimate computational time for duplicates detection for only people with name “John Smith”. Imagine that in average each person has 100 friends:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;c1&quot;&gt;# for reproducible results&lt;/span&gt;
&lt;span class=&quot;kp&quot;&gt;set.seed&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;seed &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;17&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;microbenchmark&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# we will use brilliant pipe operator %&amp;gt;%&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;magrittr&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
jaccard &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;kr&quot;&gt;function&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;x&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; y&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  set_intersection &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;length&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kp&quot;&gt;intersect&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;x&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; y&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
  set_union &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;length&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kp&quot;&gt;union&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;x&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; y&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
  &lt;span class=&quot;kr&quot;&gt;return&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;set_intersection &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; set_union&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# generate &amp;quot;lastnames&amp;quot;&lt;/span&gt;
lastnames &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;Map&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kr&quot;&gt;function&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;x&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;paste&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kp&quot;&gt;sample&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;letters&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; collapse &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1e5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;%&amp;gt;%&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;unique&lt;/span&gt;
&lt;span class=&quot;kp&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kp&quot;&gt;head&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;lastnames&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;## [[1]]
## [1] &amp;quot;eyl&amp;quot;
## 
## [[2]]
## [1] &amp;quot;ukm&amp;quot;
## 
## [[3]]
## [1] &amp;quot;fes&amp;quot;
## 
## [[4]]
## [1] &amp;quot;fka&amp;quot;
## 
## [[5]]
## [1] &amp;quot;vuw&amp;quot;
## 
## [[6]]
## [1] &amp;quot;ypg&amp;quot;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;friends_set_1 &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;sample&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;lastnames&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;100&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; replace &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;F&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
friends_set_2 &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;sample&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;lastnames&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;100&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; replace &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;F&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
microbenchmark&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;jaccard&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;friends_set_1&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; friends_set_2&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;## Unit: microseconds
##                                   expr    min     lq     mean  median
##  jaccard(friends_set_1, friends_set_2) 45.646 47.417 50.72362 48.4045
##       uq     max neval
##  49.9435 150.343   100&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;One operation takes 50 microseconds in average (on my machine). If we have 100000 of peoples with name &lt;em&gt;John Smith&lt;/em&gt; and we have to compare all pairs, total computation &lt;strong&gt;will take more than 100 hours&lt;/strong&gt;!&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;hours &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;50&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;1e-6&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;1e5&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;1e5&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;60&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;60&lt;/span&gt;
hours&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;## [1] 138.8889&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Of course this is unappropriate because of &lt;script type=&quot;math/tex&quot;&gt;O(n^2)&lt;/script&gt; complexity of our brute-force algorithm.&lt;/p&gt;

&lt;h2 id=&quot;minhashing&quot;&gt;Minhashing&lt;/h2&gt;
&lt;p&gt;To solve this kind problem we will use &lt;a href=&quot;(http://en.wikipedia.org/wiki/Locality-sensitive_hashing)&quot;&gt;Locality-sensitive hashing&lt;/a&gt; - a method of performing probabilistic dimension reduction of high-dimensional data. It provides good tradeoff between accuracy and computational time and roughly speaking has &lt;script type=&quot;math/tex&quot;&gt;O(n)&lt;/script&gt; complexity.&lt;br /&gt;
I will explain one scheme of &lt;strong&gt;LSH&lt;/strong&gt;, called &lt;a href=&quot;http://en.wikipedia.org/wiki/MinHash&quot;&gt;MinHash&lt;/a&gt;.&lt;br /&gt;
The intuition of the method is the following: we will try to hash the input items so that similar items are mapped to the same buckets with high probability (the number of buckets being much smaller than the universe of possible input items).&lt;br /&gt;
Let’s construct simple example:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;set1 &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;SMITH&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;JOHNSON&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;WILLIAMS&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;BROWN&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
set2 &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;SMITH&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;JOHNSON&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;BROWN&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
set3 &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;THOMAS&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;MARTINEZ&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;DAVIS&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
set_list &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;list&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;set1&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; set2&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; set3&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Now we have 3 sets to compare and identify profiles, related to same “John Smith”. From these sets we will construct matrix which encode relations between sets:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;sets_dict &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;unlist&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;set_list&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;%&amp;gt;%&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;unique&lt;/span&gt;

m &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;Map&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;f &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kr&quot;&gt;function&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;set&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; dict&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;as.integer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;dict &lt;span class=&quot;o&quot;&gt;%in%&lt;/span&gt; set&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; 
         set_list&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
         MoreArgs &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;list&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;dict &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; sets_dict&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;%&amp;gt;%&lt;/span&gt; 
  &lt;span class=&quot;kp&quot;&gt;do.call&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;what &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;cbind&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# This is equal to more traditional R&amp;#39;s sapply call:&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# m &amp;lt;- sapply(set_list, FUN = function(set, dict) as.integer(dict %in% set), dict = sets_dict, simplify = T)&lt;/span&gt;

&lt;span class=&quot;kp&quot;&gt;dimnames&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;m&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;list&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;sets_dict&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;paste&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;set&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;kp&quot;&gt;length&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;set_list&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; sep &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;_&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;kp&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;m&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;##          set_1 set_2 set_3
## SMITH        1     1     0
## JOHNSON      1     1     0
## WILLIAMS     1     0     0
## BROWN        1     1     0
## THOMAS       0     0     1
## MARTINEZ     0     0     1
## DAVIS        0     0     1&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Let’s call this matrix &lt;strong&gt;input-matrix&lt;/strong&gt;.
In our representation similarity of two sets from source array equal to the similarity of two corresponding columns with non-zero rows:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;name&lt;/th&gt;
      &lt;th&gt;set_1&lt;/th&gt;
      &lt;th&gt;set_2&lt;/th&gt;
      &lt;th&gt;intersecton&lt;/th&gt;
      &lt;th&gt;union&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;SMITH&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;+&lt;/td&gt;
      &lt;td&gt;+&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;JOHNSON&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;+&lt;/td&gt;
      &lt;td&gt;+&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;WILLIAMS&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;-&lt;/td&gt;
      &lt;td&gt;+&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;BROWN&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;+&lt;/td&gt;
      &lt;td&gt;+&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;THOMAS&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;-&lt;/td&gt;
      &lt;td&gt;-&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;MARTINEZ&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;-&lt;/td&gt;
      &lt;td&gt;-&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;DAVIS&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;-&lt;/td&gt;
      &lt;td&gt;-&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;From table above we can conclude, that &lt;strong&gt;jaccard index between set_1 and set_2 is 0.75&lt;/strong&gt;.&lt;br /&gt;
Let’s check:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;kp&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;jaccard&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;set1&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; set2&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;## [1] 0.75&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;column_jaccard &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;  &lt;span class=&quot;kr&quot;&gt;function&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;c1&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; c2&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  non_zero &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;which&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;c1 &lt;span class=&quot;o&quot;&gt;|&lt;/span&gt; c2&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
  column_intersect &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;sum&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;c1&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;non_zero&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; c2&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;non_zero&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;
  column_union &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;length&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;non_zero&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;kr&quot;&gt;return&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;column_intersect &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; column_union&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;kp&quot;&gt;isTRUE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;jaccard&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;set1&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; set2&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; column_jaccard&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;m&lt;span class=&quot;p&quot;&gt;[,&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; m&lt;span class=&quot;p&quot;&gt;[,&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]))&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;## [1] TRUE&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;All the magic starts here. Suppose random permutation of rows of the &lt;strong&gt;input-matrix&lt;/strong&gt; &lt;code&gt;m&lt;/code&gt;. And let’s define &lt;strong&gt;minhash function&lt;/strong&gt; &lt;script type=&quot;math/tex&quot;&gt;h(c)&lt;/script&gt; = # of first row in which column &lt;script type=&quot;math/tex&quot;&gt;c == 1&lt;/script&gt;. If we will use &lt;script type=&quot;math/tex&quot;&gt;N&lt;/script&gt; &lt;strong&gt;independent&lt;/strong&gt; permutations we will end with &lt;script type=&quot;math/tex&quot;&gt;N&lt;/script&gt; minhash functions. So we can construct &lt;strong&gt;signature-matrix&lt;/strong&gt; from &lt;strong&gt;input-matrix&lt;/strong&gt; using these minhash functions. Below we will do it not very efficiently with 2 nested &lt;code&gt;for&lt;/code&gt; loops. But the logic should be very clear.&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;c1&quot;&gt;# for our toy example we will pick N = 4&lt;/span&gt;
N &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;4&lt;/span&gt;
sm &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;matrix&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;data &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kc&quot;&gt;NA_integer_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; nrow &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; N&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; ncol &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;ncol&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;m&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
perms &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;matrix&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;data &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kc&quot;&gt;NA_integer_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; nrow &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;nrow&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;m&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; ncol &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; N&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# calculate indexes for non-zero entries for each column&lt;/span&gt;
non_zero_row_indexes &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;apply&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;m&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; MARGIN &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; FUN &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kr&quot;&gt;function&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;x&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; which &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;x &lt;span class=&quot;o&quot;&gt;!=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;kr&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;i &lt;span class=&quot;kr&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; N&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;# calculate permutations&lt;/span&gt;
  perm &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;sample&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kp&quot;&gt;nrow&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;m&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
  perms&lt;span class=&quot;p&quot;&gt;[,&lt;/span&gt; i&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; perm
  &lt;span class=&quot;c1&quot;&gt;# fill row of signature matrix&lt;/span&gt;
  &lt;span class=&quot;kr&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;j &lt;span class=&quot;kr&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;kp&quot;&gt;ncol&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;m&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
    sm&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;i&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; j&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;  &lt;span class=&quot;kp&quot;&gt;min&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;perm&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;non_zero_row_indexes&lt;span class=&quot;p&quot;&gt;[[&lt;/span&gt;j&lt;span class=&quot;p&quot;&gt;]]])&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;kp&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;sm&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;##      [,1] [,2] [,3]
## [1,]    3    3    1
## [2,]    1    1    3
## [3,]    1    1    2
## [4,]    1    1    4&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;You can see how we obtain &lt;strong&gt;signature-matrix&lt;/strong&gt; matrix after “minhash transformation”. Permutations and corresponding signatures marked with same colors:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;perm_1&lt;/th&gt;
      &lt;th&gt;perm_2&lt;/th&gt;
      &lt;th&gt;perm_3&lt;/th&gt;
      &lt;th&gt;perm_4&lt;/th&gt;
      &lt;th&gt;set_1&lt;/th&gt;
      &lt;th&gt;set_2&lt;/th&gt;
      &lt;th&gt;set_3&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;span style=&quot;background-color:lightgreen&quot;&gt;4 &lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;background-color:orange&quot;&gt;1 &lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;background-color:lightblue&quot;&gt;4 &lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;background-color:yellow&quot;&gt;6 &lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;span style=&quot;background-color:lightgreen&quot;&gt;3 &lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;background-color:orange&quot;&gt;4 &lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;background-color:lightblue&quot;&gt;1 &lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;background-color:yellow&quot;&gt;1 &lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;span style=&quot;background-color:lightgreen&quot;&gt;7 &lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;background-color:orange&quot;&gt;6 &lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;background-color:lightblue&quot;&gt;6 &lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;background-color:yellow&quot;&gt;2 &lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;span style=&quot;background-color:lightgreen&quot;&gt;6 &lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;background-color:orange&quot;&gt;2 &lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;background-color:lightblue&quot;&gt;7 &lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;background-color:yellow&quot;&gt;3 &lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;span style=&quot;background-color:lightgreen&quot;&gt;5 &lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;background-color:orange&quot;&gt;3 &lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;background-color:lightblue&quot;&gt;2 &lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;background-color:yellow&quot;&gt;5 &lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;span style=&quot;background-color:lightgreen&quot;&gt;2 &lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;background-color:orange&quot;&gt;5 &lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;background-color:lightblue&quot;&gt;3 &lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;background-color:yellow&quot;&gt;7 &lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;span style=&quot;background-color:lightgreen&quot;&gt;1 &lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;background-color:orange&quot;&gt;7 &lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;background-color:lightblue&quot;&gt;5 &lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;background-color:yellow&quot;&gt;4 &lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;set_1&lt;/th&gt;
      &lt;th&gt;set_2&lt;/th&gt;
      &lt;th&gt;set_3&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;span style=&quot;background-color:lightgreen&quot;&gt;3&lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;background-color:lightgreen&quot;&gt;3&lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;background-color:lightgreen&quot;&gt;1&lt;/span&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;span style=&quot;background-color:orange&quot;&gt;1&lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;background-color:orange&quot;&gt;1&lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;background-color:orange&quot;&gt;3&lt;/span&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;span style=&quot;background-color:lightblue&quot;&gt;1&lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;background-color:lightblue&quot;&gt;1&lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;background-color:lightblue&quot;&gt;2&lt;/span&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;span style=&quot;background-color:yellow&quot;&gt;1&lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;background-color:yellow&quot;&gt;1&lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;background-color:yellow&quot;&gt;4&lt;/span&gt;&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;You can notice that set_1 and set_2 signatures are very similar and signature of set_3 dissimilar with set_1 and set_2.&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;jaccard_signatures &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;  &lt;span class=&quot;kr&quot;&gt;function&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;c1&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; c2&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  column_intersect &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;sum&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;c1 &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; c2&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
  column_union &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;length&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;c1&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;kr&quot;&gt;return&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;column_intersect &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; column_union&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;kp&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;jaccard_signatures&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;sm&lt;span class=&quot;p&quot;&gt;[,&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; sm&lt;span class=&quot;p&quot;&gt;[,&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]))&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;## [1] 1&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;kp&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;jaccard_signatures&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;sm&lt;span class=&quot;p&quot;&gt;[,&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; sm&lt;span class=&quot;p&quot;&gt;[,&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]))&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;## [1] 0&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Intuition is very straighforward. Let’s look down the permuted columns &lt;script type=&quot;math/tex&quot;&gt;c_1&lt;/script&gt; and &lt;script type=&quot;math/tex&quot;&gt;c_2&lt;/script&gt; until we detect &lt;strong&gt;1&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;If in both columns we find ones - (1, 1), then &lt;script type=&quot;math/tex&quot;&gt;h(c_1) = h(c_2)&lt;/script&gt;.&lt;/li&gt;
  &lt;li&gt;In case (0, 1) or (1, 0) &lt;script type=&quot;math/tex&quot;&gt;h(c_1) \neq h(c_2)&lt;/script&gt;. So the probability over all permutations of rows that &lt;script type=&quot;math/tex&quot;&gt;h(c_1) = h(c_2)&lt;/script&gt; is the same as &lt;script type=&quot;math/tex&quot;&gt;J(c_1, c_2)&lt;/script&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Moreover there exist theoretical guaranties for estimation of Jaccard similarity: for any constant &lt;script type=&quot;math/tex&quot;&gt;\varepsilon &gt; 0&lt;/script&gt; there is a constant &lt;script type=&quot;math/tex&quot;&gt;k = O(1/\varepsilon^2)&lt;/script&gt;
such that the expected error of the estimate is at most &lt;script type=&quot;math/tex&quot;&gt;\varepsilon&lt;/script&gt;.&lt;/p&gt;

&lt;h3 id=&quot;implementation-and-bottlenecks&quot;&gt;Implementation and bottlenecks&lt;/h3&gt;
&lt;p&gt;Suppose &lt;strong&gt;input-matrix&lt;/strong&gt; is very big, say &lt;code&gt;1e9&lt;/code&gt; rows. It is quite hard computationally to permute 1 billion rows. Plus you need to store these entries and access these values. It is common to use following scheme instead:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Pick &lt;script type=&quot;math/tex&quot;&gt;N&lt;/script&gt; independent hash functions &lt;script type=&quot;math/tex&quot;&gt;h_i(c)&lt;/script&gt; instead of &lt;script type=&quot;math/tex&quot;&gt;N&lt;/script&gt; premutations, &lt;script type=&quot;math/tex&quot;&gt;i = 1..N&lt;/script&gt;.&lt;/li&gt;
  &lt;li&gt;For each column &lt;script type=&quot;math/tex&quot;&gt;c&lt;/script&gt; and each hash function &lt;script type=&quot;math/tex&quot;&gt;h_i&lt;/script&gt;, keep a “slot” &lt;script type=&quot;math/tex&quot;&gt;M(i, c)&lt;/script&gt;.&lt;/li&gt;
  &lt;li&gt;&lt;script type=&quot;math/tex&quot;&gt;M(i, c)&lt;/script&gt; will become the smallest value of &lt;script type=&quot;math/tex&quot;&gt;h_i(r)&lt;/script&gt; for which column &lt;script type=&quot;math/tex&quot;&gt;c&lt;/script&gt; has 1 in row &lt;script type=&quot;math/tex&quot;&gt;r&lt;/script&gt;. I.e., &lt;script type=&quot;math/tex&quot;&gt;h_i(r)&lt;/script&gt; gives order of rows for &lt;script type=&quot;math/tex&quot;&gt;i^{th}&lt;/script&gt; permutation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So we end up with following &lt;strong&gt;ALGORITHM(1)&lt;/strong&gt; from excellent &lt;a href=&quot;http://www.mmds.org&quot;&gt;Mining of Massive Datasets&lt;/a&gt; book:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;for each row r do begin
  for each hash function hi do
    compute hi (r);
  for each column c
    if c has 1 in row r
      for each hash function hi do
        if hi(r) is smaller than M(i, c) then
          M(i, c) := hi(r);
end;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I &lt;strong&gt;highly recommend&lt;/strong&gt; to watch video about minhashing from Stanford &lt;a href=&quot;https://class.coursera.org/mmds-001&quot;&gt;Mining Massive Datasets&lt;/a&gt; course.&lt;/p&gt;

&lt;div align=&quot;center&quot;&gt;&lt;iframe width=&quot;854&quot; height=&quot;510&quot; src=&quot;http://www.youtube.com/embed/pqZh-Uu9VSk&quot; frameborder=&quot;0&quot; allowfullscreen=&quot;&quot;&gt;&lt;/iframe&gt;&lt;/div&gt;

&lt;h2 id=&quot;summary&quot;&gt;Summary&lt;/h2&gt;
&lt;p&gt;Let’s summarize what we have learned from first part of tutorial:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;We can construct &lt;strong&gt;input-matrix&lt;/strong&gt; from given list of sets. But actually we didn’t exploit the fact, that &lt;strong&gt;input-matrix&lt;/strong&gt; is &lt;strong&gt;very sparse&lt;/strong&gt; and construct it as R’s regular dense matrix. It is very computationally and RAM inefficient.&lt;/li&gt;
  &lt;li&gt;We can construct &lt;strong&gt;dense&lt;/strong&gt; signature-matrix from &lt;strong&gt;input-matrix&lt;/strong&gt;. But we only implemented algorithm that is based on permutations and also not very efficient.&lt;/li&gt;
  &lt;li&gt;We understand &lt;strong&gt;theorethical guaranties&lt;/strong&gt; of our algorithm. They are proportional to number of &lt;strong&gt;independent&lt;/strong&gt; hash functions we will pick. But how will we actually construct this family of functions? How can we efficiently increase number of functions in our family when needed?&lt;/li&gt;
  &lt;li&gt;Our &lt;strong&gt;signature-matrix&lt;/strong&gt; has small &lt;strong&gt;fixed&lt;/strong&gt; number of rows. Each column represents input set and &lt;script type=&quot;math/tex&quot;&gt;J(c_1, c_2)&lt;/script&gt; ~ &lt;script type=&quot;math/tex&quot;&gt;J(set_1, set_2)&lt;/script&gt;. But we &lt;strong&gt;still have &lt;script type=&quot;math/tex&quot;&gt;O(n^2)&lt;/script&gt; complexity&lt;/strong&gt;, because we need to compair each pair to find duplicate candidates.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In the next posts I will describe how to efficently construct and store &lt;strong&gt;input-matrix&lt;/strong&gt; in &lt;strong&gt;sparse&lt;/strong&gt; format.
Then we will discuss how to &lt;strong&gt;construct family of hash functions&lt;/strong&gt;. After that we will implement &lt;strong&gt;fast vectorized&lt;/strong&gt; version of &lt;strong&gt;ALGORITHM(1)&lt;/strong&gt;. And finally we will see how to use &lt;strong&gt;Locality Sensitive Hashing&lt;/strong&gt; to determine candidate pairs for similar sets in &lt;script type=&quot;math/tex&quot;&gt;O(n)&lt;/script&gt; time. Stay tuned!&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>Rmongodb 1.8.0</title>
   <link href="http://dsnotes.com/blog/2014/11/02/rmongodb-1.8.0"/>
   <updated>2014-11-02T00:00:00+00:00</updated>
   <id>http://dsnotes.com/blog/2014/11/02/rmongodb-1.8.0</id>
   <content type="html">
&lt;p&gt;Today I’m introducing new version of rmongodb  (which I started to maintain) – v1.8.0. Install it from github:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;kn&quot;&gt;library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;devtools&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
install_github&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;mongosoup/rmongodb@v1.8.0&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Release version will be uploaded to CRAN shortly.
This release brings a lot of improvements to rmongodb:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Now rmongodb correctly handles arrays.
    &lt;ul&gt;
      &lt;li&gt;&lt;code&gt;mongo.bson.to.list()&lt;/code&gt; rewritten from scratch. R’s &lt;em&gt;unnamed lists&lt;/em&gt; are treated as arrays, &lt;em&gt;named lists&lt;/em&gt; as objects. Also it has an option – whether to try to simplify vanilla lists to arrays or not.&lt;/li&gt;
      &lt;li&gt;&lt;code&gt;mongo.bson.from.list()&lt;/code&gt;  updated.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;code&gt;mongo.cursor.to.list()&lt;/code&gt;  rewritten and has slightly &lt;strong&gt;changed behavior&lt;/strong&gt; – it doesn’t produce any type coercions while fetching data from cursor.&lt;/li&gt;
  &lt;li&gt;&lt;code&gt;mongo.aggregation()&lt;/code&gt; has new options to match MongoDB 2.6+ features. Also second argument now called &lt;em&gt;pipeline&lt;/em&gt; (as it is called in MongoDB command).&lt;/li&gt;
  &lt;li&gt;new function &lt;code&gt;mongo.index.TTLcreate()&lt;/code&gt;  – creating indexes with “time to live” property.&lt;/li&gt;
  &lt;li&gt;R’s &lt;code&gt;NA&lt;/code&gt;  values now converted into MongoDB &lt;code&gt;null&lt;/code&gt; values.&lt;/li&gt;
  &lt;li&gt;many bug fixes (including troubles with installation on Windows) – see &lt;a href=&quot;https://github.com/mongosoup/rmongodb/issues?q=milestone%3A1.8.0+is%3Aclosed&quot;&gt;full list&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I want to highlight some of changes.&lt;br /&gt;
The &lt;strong&gt;first most important&lt;/strong&gt; is that now rmongodb correctly handles arrays. This issue was very annoying for many users (including me :-). Moreover about half of rmongodb related questions at &lt;a href=&quot;http://stackoverflow.com/questions/tagged/rmongodb&quot;&gt;stackoverflow&lt;/a&gt; were caused by this issue. In new version of package, &lt;code&gt;mongo.bson.to.list()&lt;/code&gt; is rewritten from scratch and  &lt;code&gt;mongo.bson.from.list()&lt;/code&gt;  fixed. I heavily tested new behaviour and all works very smooth. Still it’s quite big internal change, because these fucntions are workhorses for many other high-level rmongodb functions. Please test it, your &lt;em&gt;feedback is very wellcome&lt;/em&gt;. For example here is convertion of complex JSON into BSON using &lt;code&gt;mongo.bson.from.JSON()&lt;/code&gt;  (which internally call &lt;code&gt;mongo.bson.from.list()&lt;/code&gt;):&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;kn&quot;&gt;library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;rmongodb&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
json_string &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;{&amp;quot;_id&amp;quot;: &amp;quot;dummyID&amp;quot;, &amp;quot;arr&amp;quot;:[&amp;quot;string&amp;quot;,3.14,[1,&amp;quot;2&amp;quot;,[3],{&amp;quot;four&amp;quot;:4}],{&amp;quot;mol&amp;quot;:42}]}&amp;#39;&lt;/span&gt;
bson &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; mongo.bson.from.JSON &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;json_string&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This will produce following MongoDB document:
&lt;code&gt;
{&quot;_id&quot;: &quot;dummyID&quot;, &quot;arr&quot;:[&quot;string&quot;,3.14,[1,&quot;2&quot;,[3],{&quot;four&quot;:4}],{&quot;mol&quot;:42}]}  
&lt;/code&gt;&lt;br /&gt;
The &lt;strong&gt;second one&lt;/strong&gt; is that &lt;code&gt;mongo.cursor.to.list()&lt;/code&gt;  has new behaviour: it returns plain list of objects without any coercion. Each element of list corresponds to a document of underlying query result. Additional improvement is that &lt;code&gt;mongo.cursor.to.list()&lt;/code&gt;  uses R’s &lt;em&gt;environments&lt;/em&gt; to avoid extra copying, so now it is much more efficient than previous version (especially when fetching a lot of records from MongoDB).&lt;/p&gt;

&lt;p&gt;In the next few releases I have plans to upgrade underlying &lt;a href=&quot;https://github.com/mongodb/mongo-c-driver-legacy&quot;&gt;mongo-c-driver-legacy&lt;/a&gt; to latest version &lt;strong&gt;0.8.1&lt;/strong&gt;.&lt;/p&gt;
</content>
 </entry>
 
 
</feed>