Lessons learned from “Outbrain Click Prediction” kaggle competition (part 2)

Read more →

Lessons learned from “Outbrain Click Prediction” kaggle competition (part 1)

Read more →

text2vec 0.4

text2vec

Introducing text2vec 0.4 Today I’m pleased to announce new major release of text2vec - text2vec 0.4 which is already on CRAN. For those readers who is not familiar with text2vec - it is an R package which provides an efficient framework with a concise API for text analysis and natural language processing. With this release I also launched project homepage - http://text2vec.org where you can find up-to-date documents and tutorials.

Read more →

text2vec 0.3

text2vec

updated 2016-03-31 - few functions renamed updated 2016-10-07 - see updated tutorial for text2vec 0.4 Today I’m pleased to announce preview of the new version of text2vec. It is located in the 0.3 development branch, but very soon (probably in about a week) it will be merged into master. To reproduce examples below, please install text2vec@0.3 from github: devtools::install_github('dselivanov/text2vec@0.3') Also I’m waiting for feedback from text2vec users, please spend a few minutes:

Read more →

Disclaimer: originally I planned to write post about R functions/packages which allow to read data from hdfs (with benchmarks), but in the end it became more like an overview of SparkR capabilities. Nowadays working with “big data” almost always means working with hadoop ecosystem. A few years ago this also meant that you also would have to be a good java programmer to work in such environment - even simple word count program took several dozens of lines of code.

Read more →

Before reading this post, I very recommend to read: Orignal GloVe paper Jon Gauthier’s post, which provides detailed explanation of python implementation. This post helps me a lot with C++ implementation. Word embeddings After Tomas Mikolov et al. released word2vec tool, there was a boom of articles about words vector representations. One of the greatest is GloVe, which did a big thing by explaining how such algorithms work.

Read more →

Today I will start to publish series of posts about experiments on english wikipedia. As I said before, text2vec is inspired by gensim - well designed and quite efficient python library for topic modeling and related NLP tasks. Also I found very useful Radim’s posts, where he tried to evaluate some algorithms on english wikipedia dump. This dataset is rather big. For example, dump for 2015-10 (which will be used below) is 12gb bzip2 compressed file.

Read more →

updated 2016-10-07 - see post with updated tutorial for text2vec 0.4 In the last weeks I have actively worked on text2vec (formerly tmlite) - R package, which provides tools for fast text vectorization and state-of-the art word embeddings. This project is an experiment for me - what can a single person do in a particular area? After these hard weeks, I believe, he can do a lot. There are a lot of changes from my previous introduction post, and I want to highlight few of them:

Read more →

As I know, there are few choices to connect from R to MS SQL Server: RODBC RJDBC rsqlserver But only second option can be used on mac and linux machines. Here is nice stackoverflow thread. Most of the people suggest to use microsoft sql java driver. But there is a case when this will not help - windows domain authentification. In this situation I found the only working solution is to use nice jTDS.

Read more →

The main purpose of this post is to keep all steps of installing cuda toolkit (and R related packages) and in one place. Also I hope this may be useful for someone. Installing cuda toolkit ( Ubuntu ) First of all we need to install nvidia cuda toolkti. I’am on latest ubuntu 15.04, but found this article well suited for me. But there are few additions: It is very important to have no nvidia drivers before installation ( first I corrupted my system and have to reinstall it :-( ).

Read more →

Introduction In the next series of posts I will try to explain base concepts Locality Sensitive Hashing technique. Note, that I will try to follow general functional programming style. So I will use R’s Higher-Order Functions instead of traditional R’s *apply functions family (I suppose this post will be more readable for non R users). Also I will use brilliant pipe operator %>% from magrittr package. We will start with basic concepts, but end with very efficient implementation in R (it is about 100 times faster than python implementations I found).

Read more →

rmongodb 1.8.0

mongodb

Today I’m introducing new version of rmongodb (which I started to maintain) – v1.8.0. Install it from github: library(devtools) install_github("mongosoup/rmongodb@v1.8.0") Release version will be uploaded to CRAN shortly. This release brings a lot of improvements to rmongodb: Now rmongodb correctly handles arrays. mongo.bson.to.list() rewritten from scratch. R’s unnamed lists are treated as arrays, named lists as objects. Also it has an option – whether to try to simplify vanilla lists to arrays or not.

Read more →