As I promised in last post, I’m going to share benchmark of different implementation of matrix factorization with Weighted Alternating Least Squares. User-Item interaction matrix is made from lastfm-360K dataset. Implementations incude:
- My reco R package
- Ben Frederickson implicit python module
- Apache Spark implementation
- Quora’s qmf solver
For the transparency I’ve created repository with all the code. If you will find any flaws plese write me a message.
- Hardware - Intel Xeon X3470, 4 physical cores, 8 threads (Nehalem architecture). All implementations used 8 threads since I found this adds some performance compared to 4 cores.
- OpenBLAS. All implementations use high-level parallelism, so in order to avoid thread contention I’ve limited BLAS threads to 1:
- In order to compare apples-to-apples all C-family imlpementations (
implicit) were compiled with following flags:
-O3 -ffast-math -march=native -msse4.2
- Spark was compiled with native BLAS support as described in instruction
- I didn’t compare accuracy of implementations. I’m pretty sure about
qmf, but I’m quite sceptical about Spark. From my experience almost all algorithms from Spark’s MlLib are far from perfect (mildly speaking)
- Each implementation run for 3 iterations
library(ggplot2) library(plotly) df = read.csv("https://raw.githubusercontent.com/dselivanov/bench-wals/master/results.csv") g = ggplot(df) + geom_line(aes(x = rank, y = time, col = solver)) + geom_point(aes(x = rank, y = time, col = solver)) ggplotly(g, width = 9)
Biggest surprise is that Spark’s implementation was comparable to others! On
rank = 128 it even outperforms
qmf. I don’t know exact reason, but may be
qmf doesn’t use native BLAS. Also it would be worth to check ranking accuracy of Spark’s results.
- Ben’s solver based on Conjugate Gradient is the fastest
- My reco is fastest among exact solvers. Would be interesting to implement CG solver as well.
- At small ranks Spark is several times slower than other packages. But with larger values of rank it cathes. Seems for small ranks overhead of calling native routines is substantial.