Data Science Notes

Introducing text2vec 0.4 Today I’m pleased to announce new major release of text2vec – text2vec 0.4 which is already on CRAN. For those readers who is not familiar with text2vec – it is an R package which provides an efficient framework with a concise API for text analysis and natural language processing. With this release I also launched project homepage – http://text2vec.org where you can find up-to-date documents and tutorials.

text2vec 0.3

17 Mar, 2016 text2vec

updated 2016-03-31 – few functions renamed updated 2016-10-07 – see updated tutorial for text2vec 0.4 Today I’m pleased to announce preview of the new version of text2vec. It is located in the 0.3 development branch, but very soon (probably in about a week) it will be merged into master. To reproduce examples below, please install [email protected] from github: devtools::install_github(‘dselivanov/[email protected]’) Also I’m waiting for feedback from text2vec users, please spend a few minutes:

Read from hdfs with R. Brief overview of SparkR.

20 Feb, 2016 Spark SparkR data_table

Disclaimer: originally I planned to write post about R functions/packages which allow to read data from hdfs (with benchmarks), but in the end it became more like an overview of SparkR capabilities. Nowadays working with “big data” almost always means working with hadoop ecosystem. A few years ago this also meant that you also would have to be a good java programmer to work in such environment – even simple word count program took several dozens of lines of code.

text2vec GloVe implementation details

9 Jan, 2016 text2vec Rcpp RcppParallel GloVe

Before reading this post, I very recommend to read: Orignal GloVe paper Jon Gauthier’s post, which provides detailed explanation of python implementation. This post helps me a lot with C++ implementation. Word embeddings After Tomas Mikolov et al. released word2vec tool, there was a boom of articles about words vector representations. One of the greatest is GloVe, which did a big thing by explaining how such algorithms work.

GloVe vs word2vec revisited.

1 Dec, 2015 text2vec GloVe word2vec

Today I will start to publish series of posts about experiments on english wikipedia. As I said before, text2vec is inspired by gensim – well designed and quite efficient python library for topic modeling and related NLP tasks. Also I found very useful Radim’s posts, where he tried to evaluate some algorithms on english wikipedia dump. This dataset is rather big. For example, dump for 2015-10 (which will be used below) is 12gb bzip2 compressed file.

Analyzing texts with text2vec package

9 Nov, 2015 text2vec

updated 2016-10-07 – see post with updated tutorial for text2vec 0.4 In the last weeks I have actively worked on text2vec (formerly tmlite) – R package, which provides tools for fast text vectorization and state-of-the art word embeddings. This project is an experiment for me – what can a single person do in a particular area? After these hard weeks, I believe, he can do a lot. There are a lot of changes from my previous introduction post, and I want to highlight few of them:

Working with MS SQL server on non-windows systems

16 Jul, 2015 setup

As I know, there are few choices to connect from R to MS SQL Server: RODBC RJDBC rsqlserver But only second option can be used on mac and linux machines. Here is nice stackoverflow thread. Most of the people suggest to use microsoft sql java driver. But there is a case when this will not help – windows domain authentification. In this situation I found the only working solution is to use nice jTDS.

Installing cuda toolkit and related R packages

4 Jun, 2015 GPGPU setup

The main purpose of this post is to keep all steps of installing cuda toolkit (and R related packages) and in one place. Also I hope this may be useful for someone. Installing cuda toolkit ( Ubuntu ) First of all we need to install nvidia cuda toolkti. I’am on latest ubuntu 15.04, but found this article well suited for me. But there are few additions: It is very important to have no nvidia drivers before installation ( first I corrupted my system and have to reinstall it 🙁 ).

Locality Sensitive Hashing in R

2 Jan, 2015 LSH

Introduction In the next series of posts I will try to explain base concepts Locality Sensitive Hashing technique. Note, that I will try to follow general functional programming style. So I will use R’s Higher-Order Functions instead of traditional R’s *apply functions family (I suppose this post will be more readable for non R users). Also I will use brilliant pipe operator %>% from magrittr package. We will start with basic concepts, but end with very efficient implementation in R (it is about 100 times faster than python implementations I found).

rmongodb 1.8.0

2 Nov, 2014 mongodb

Today I’m introducing new version of rmongodb (which I started to maintain) – v1.8.0. Install it from github: library(devtools) install_github(“mongosoup/[email protected]”) Release version will be uploaded to CRAN shortly. This release brings a lot of improvements to rmongodb: Now rmongodb correctly handles arrays. mongo.bson.to.list() rewritten from scratch. R’s unnamed lists are treated as arrays, named lists as objects. Also it has an option – whether to try to simplify vanilla lists to arrays or not.

1 Jan, 0001

Matrix factorization for recommender systems (part 2) code{white-space: pre;} pre:not([class]) { background-color: white; } if (window.hljs && document.readyState && document.readyState === “complete”) { window.setTimeout(function() { hljs.initHighlighting(); }, 0); } h1 { font-size: 34px; } h1.title { font-size: 38px; } h2 { font-size: 30px; } h3 { font-size: 24px; } h4 { font-size: 18px; } h5 { font-size: 16px; } h6 { font-size: 12px; } .

Latest Posts

tag