I am currently working on a QR algorithm, but dealing with very small matrices, i.e. max 128. For instance for a matrix vector multiplication A*x = b, I load the whole vector x into the shared memory (A resides in global memory) and compute matrix vector multiplications based on scalar product computation. Each block computes a single element of b. My problem is that I need b to reside again in shared memory or if this is not possible in global memory, so every block has access to b.
Well I am not sure if I am allowed to read from global memory and next write to the same global memory location or vice versa.
I have already programmed one version where a single block is computing the whole matrix - vector multiplication but this causes a great peformance loss.
Do i have to split my kernel into several kernels?
Any help or advice is appreciated.