Hi, I don’t have big experience with CUDA programming, but i’m familiar with CUDA Programming Guide.

I’ve found some interesting (for me) algebra problem, which i suspect is not easy to speed up with CUDA, but i’m not sure.

The problem is:

How to speed up multiple matrix per vector multiplications? Matrix is always the same, vectors are differrent in each computation, and are dependent on the previous multiplication result.

So:

M - matrix

v1, v2, …, vn - vectors (only v1 is known at begin, if you want to use vn, you have first to compute vn-1)

Call patterns is:

```
CPU: Transfer M (matrix) to GPU
CPU: Compute v1
CPU: Call M * v1 multiplication
GPU: Multiplicate M * v1 to r1
CPU: Basing on r1 and some algorithm (not important here) compute v2
CPU: Call M * v2 multiplication
GPU: Multiplicate M * v2 to r2
CPU: Basing on r2 and some algorithm (not important here) compute v3
CPU: Call M * v3 multiplication
GPU: Multiplicate M * v3 to r3
.
. (etc)
.
```

The matrix size is 256 x 256 and the vector 256 x 1.

If somebody is curious what this is for, i want to speed up neural networks computing (which main part can be view as a matrix per vector multiplication). I’ve to call neural network few times, but each input depends on previous network output.