Hi, I don’t have big experience with CUDA programming, but i’m familiar with CUDA Programming Guide.
I’ve found some interesting (for me) algebra problem, which i suspect is not easy to speed up with CUDA, but i’m not sure.
The problem is:
How to speed up multiple matrix per vector multiplications? Matrix is always the same, vectors are differrent in each computation, and are dependent on the previous multiplication result.
M - matrix
v1, v2, …, vn - vectors (only v1 is known at begin, if you want to use vn, you have first to compute vn-1)
Call patterns is:
CPU: Transfer M (matrix) to GPU CPU: Compute v1 CPU: Call M * v1 multiplication GPU: Multiplicate M * v1 to r1 CPU: Basing on r1 and some algorithm (not important here) compute v2 CPU: Call M * v2 multiplication GPU: Multiplicate M * v2 to r2 CPU: Basing on r2 and some algorithm (not important here) compute v3 CPU: Call M * v3 multiplication GPU: Multiplicate M * v3 to r3 . . (etc) .
The matrix size is 256 x 256 and the vector 256 x 1.
If somebody is curious what this is for, i want to speed up neural networks computing (which main part can be view as a matrix per vector multiplication). I’ve to call neural network few times, but each input depends on previous network output.