updated 01/08/2017 – added CG solver in reco, adjusted results
As I promised in last post, I’m going to share benchmark of different implementation of matrix factorization with Weighted Alternating Least Squares. User-Item interaction matrix is made from lastfm-360K dataset. Implementations incude:
- My reco R package
- Ben Frederickson implicit python module
- Apache Spark implementation
- Quora’s qmf solver
For the transparency I’ve created repository with all the code. If you will find any flaws plese write me a message.
Benchmark set-up
- Hardware – Intel Xeon X3470, 4 physical cores, 8 threads (Nehalem architecture). All implementations used 8 threads since I found this adds some performance compared to 4 cores.
- OpenBLAS. All implementations use high-level parallelism, so in order to avoid thread contention I’ve limited BLAS threads to 1:
export OPENBLAS_NUM_THREADS=1
- In order to compare apples-to-apples all C-family imlpementations (
qmf
,rect
,implicit
) were compiled with following flags:-O3 -ffast-math -march=native -msse4.2
- Spark was compiled with native BLAS support as described in instruction
- I didn’t compare accuracy of implementations. I’m pretty sure about
reco
,implicit
andqmf
, but I’m quite sceptical about Spark. From my experience almost all algorithms from Spark’s MlLib are far from perfect (mildly speaking) - Each implementation run for 3 iterations
Results
library(ggplot2)
library(plotly)
df = read.csv("https://raw.githubusercontent.com/dselivanov/bench-wals/master/results.csv")
g = ggplot(df) +
geom_line(aes(x = rank, y = time, col = solver)) +
geom_point(aes(x = rank, y = time, col = solver))
ggplotly(g, width = 9)
Surprise
Biggest surprise is that Spark’s implementation was comparable to others! On rank = 128
it even outperforms qmf
. I don’t know exact reason, but may be qmf
doesn’t use native BLAS. Also it would be worth to check ranking accuracy of Spark’s results.
Conclusions
- Ben’s solver based on Conjugate Gradient is the fastest. reco’s Conjugate Gradient only a little bit slower (20% on rank 128).
- reco is fastest among Cholesky solvers.
Would be interesting to implement CG solver as well.– done. - At small ranks Spark is several times slower than other packages. But with larger values of rank it cathes. Seems for small ranks overhead of calling native routines is substantial.