Is to possible to speed up multiple matrix per vector multiplication using CUDA?

Hi, I don’t have big experience with CUDA programming, but i’m familiar with CUDA Programming Guide.

I’ve found some interesting (for me) algebra problem, which i suspect is not easy to speed up with CUDA, but i’m not sure.

The problem is:

How to speed up multiple matrix per vector multiplications? Matrix is always the same, vectors are differrent in each computation, and are dependent on the previous multiplication result.

So:

M - matrix

v1, v2, …, vn - vectors (only v1 is known at begin, if you want to use vn, you have first to compute vn-1)

Call patterns is:

CPU: Transfer M (matrix) to GPU

CPU: Compute v1

CPU: Call M * v1 multiplication

	GPU: Multiplicate M * v1 to r1

CPU: Basing on r1 and some algorithm (not important here) compute v2

CPU: Call M * v2 multiplication

   GPU: Multiplicate M * v2 to r2

CPU: Basing on r2 and some algorithm (not important here) compute v3

CPU: Call M * v3 multiplication

   GPU: Multiplicate M * v3 to r3

.

. (etc)

.

The matrix size is 256 x 256 and the vector 256 x 1.

If somebody is curious what this is for, i want to speed up neural networks computing (which main part can be view as a matrix per vector multiplication). I’ve to call neural network few times, but each input depends on previous network output.

A 256x256 matrix-vector product is rather small for a GT200 or GF100 based gpu. It might be a little faster than the host CPU, but not spectacularly so.

The description of the solution sequence sounds very similar to the stages of a diagonally implicit Runge-Kutta method. If it is, it might possible to re-formulate it (depending on how many stages there are), as a single large block sparse linear system and solve it iteratively using a Newton method. That sort of problem might be much better suited to solving on the GPU.