Hi, I don’t have big experience with CUDA programming, but i’m familiar with CUDA Programming Guide.
I’ve found some interesting (for me) algebra problem, which i suspect is not easy to speed up with CUDA, but i’m not sure.
The problem is:
How to speed up multiple matrix per vector multiplications? Matrix is always the same, vectors are differrent in each computation, and are dependent on the previous multiplication result.
So:
M - matrix
v1, v2, …, vn - vectors (only v1 is known at begin, if you want to use vn, you have first to compute vn-1)
Call patterns is:
CPU: Transfer M (matrix) to GPU
CPU: Compute v1
CPU: Call M * v1 multiplication
GPU: Multiplicate M * v1 to r1
CPU: Basing on r1 and some algorithm (not important here) compute v2
CPU: Call M * v2 multiplication
GPU: Multiplicate M * v2 to r2
CPU: Basing on r2 and some algorithm (not important here) compute v3
CPU: Call M * v3 multiplication
GPU: Multiplicate M * v3 to r3
.
. (etc)
.
The matrix size is 256 x 256 and the vector 256 x 1.
If somebody is curious what this is for, i want to speed up neural networks computing (which main part can be view as a matrix per vector multiplication). I’ve to call neural network few times, but each input depends on previous network output.