Benchmarking different implementations of weighted-ALS matrix factorization

updated 01/08/2017 – added CG solver in reco, adjusted results

As I promised in last post, I’m going to share benchmark of different implementation of matrix factorization with Weighted Alternating Least Squares. User-Item interaction matrix is made from lastfm-360K dataset. Implementations incude:

My reco R package
Ben Frederickson implicit python module
Apache Spark implementation
Quora’s qmf solver

For the transparency I’ve created repository with all the code. If you will find any flaws plese write me a message.

Benchmark set-up

Hardware – Intel Xeon X3470, 4 physical cores, 8 threads (Nehalem architecture). All implementations used 8 threads since I found this adds some performance compared to 4 cores.
OpenBLAS. All implementations use high-level parallelism, so in order to avoid thread contention I’ve limited BLAS threads to 1: export OPENBLAS_NUM_THREADS=1
In order to compare apples-to-apples all C-family imlpementations (qmf, rect, implicit) were compiled with following flags: -O3 -ffast-math -march=native -msse4.2
Spark was compiled with native BLAS support as described in instruction
I didn’t compare accuracy of implementations. I’m pretty sure about reco, implicit and qmf, but I’m quite sceptical about Spark. From my experience almost all algorithms from Spark’s MlLib are far from perfect (mildly speaking)
Each implementation run for 3 iterations

Results

library(ggplot2)
library(plotly)
df = read.csv("https://raw.githubusercontent.com/dselivanov/bench-wals/master/results.csv")
g = ggplot(df) + 
  geom_line(aes(x = rank, y = time, col = solver)) + 
  geom_point(aes(x = rank, y = time, col = solver))
ggplotly(g, width = 9)

Surprise

Biggest surprise is that Spark’s implementation was comparable to others! On rank = 128 it even outperforms qmf. I don’t know exact reason, but may be qmf doesn’t use native BLAS. Also it would be worth to check ranking accuracy of Spark’s results.

Conclusions

Ben’s solver based on Conjugate Gradient is the fastest. reco’s Conjugate Gradient only a little bit slower (20% on rank 128).
reco is fastest among Cholesky solvers. ~~Would be interesting to implement CG solver as well.~~ – done.
At small ranks Spark is several times slower than other packages. But with larger values of rank it cathes. Seems for small ranks overhead of calling native routines is substantial.