Accelerate Recommender Systems with GPUs

Originally published at: https://developer.nvidia.com/blog/accelerate-recommender-systems-with-gpus/

Wei Tan, Research Staff Member at IBM T. J. Watson Research Center shares how IBM is using NVIDIA GPUs to accelerate recommender systems, which use ratings or user behavior to recommend new products, items or content to users. Recommender systems are important in applications such as recommending products on retail sites, recommending movies or music on…

I am just slightly disappointed our work did not get this kind of attention http://dx.doi.org/10.1109/B...

I looked at the code and am wondering how to choose the optimal number for iterations of conjugate gradient. I see that the default is 6. Is there any experimentation or guidance for how many is best?

It would seem to me that a CG iteration that gives a relative reduction in the residual that is much less than the first reduction for the next 'side' is likely to get wiped out. Therefore it might make sense to stop CG iterations when the fractional reduction in residual norm is less than that of the last one done for the other side.

On the other hand, I'm guessing that it is more computationally-efficient to run multiple iterations on the same matrix than switching between them.

Also, minor edit, I think: "where n_{x_u} and n_{\theta_v} denote the number of total ratings on user u and item v, respectively."

Thanks!

Thanks for the comment John and apologize for the late reply.
As for CG iteration, we tested on three data sets (netflix, yahoomusic and hugewiki) and found that <=10 iterations is good enough.
I am not very sure what do you mean "a CG iteration that gives a relative reduction in the residual that is
much less than the first reduction for the next 'side' is likely to get
wiped out."
Could you explain more?

"I'm guessing that it is more computationally-efficient to run multiple
iterations on the same matrix than switching between them." -- solving one matrix at a time is not able to saturate GPU so we run batch solvers. At the end of the day, CG kernel is memory bandwidth bounded and computation is very light (vector inner product and vector times scalar).

Thanks for identifying the typo!
Wei