I am currently working on a QR algorithm, but dealing with very small matrices, i.e. max 128. For instance for a matrix vector multiplication A*x = b, I load the whole vector x into the shared memory (A resides in global memory) and compute matrix vector multiplications based on scalar product computation. Each block computes a single element of b. My problem is that I need b to reside again in shared memory or if this is not possible in global memory, so every block has access to b.

Well I am not sure if I am allowed to read from global memory and next write to the same global memory location or vice versa.

I have already programmed one version where a single block is computing the whole matrix - vector multiplication but this causes a great peformance loss.

Do i have to split my kernel into several kernels?

Right now the only way to be able to be sure it has finished without stopping and relaunching the program is to run your computation in a single block and __syncthreads() when you are done.

This is not an optimal solution as you can get much more performance out of the G80 when you are running many blocks as opposed to only one.

Maybe you can break your program down into parts, where each part only has dependencies on things done within its own block?

Thx for the reply, but I think the only way to do it for this kind of constellation is to solve the dependencies by sperating the code into 2 kernel invokations. I think that I dont have any other choice. However , if one deals with bigger matrices the vector to shared memory loading is not possible any more, so that other issues become much more important.