Memory Instructions (kernel) device <-> shared, device <-> device

sicb0161 · October 10, 2007, 3:17pm

Hi all,

I am currently working on a QR algorithm, but dealing with very small matrices, i.e. max 128. For instance for a matrix vector multiplication A*x = b, I load the whole vector x into the shared memory (A resides in global memory) and compute matrix vector multiplications based on scalar product computation. Each block computes a single element of b. My problem is that I need b to reside again in shared memory or if this is not possible in global memory, so every block has access to b.

Well I am not sure if I am allowed to read from global memory and next write to the same global memory location or vice versa.

I have already programmed one version where a single block is computing the whole matrix - vector multiplication but this causes a great peformance loss.

Do i have to split my kernel into several kernels?

Any help or advice is appreciated.

cem

MisterAnderson42 · October 10, 2007, 3:59pm

You can write back to the same global memory location, but you can only be sure that the write has completed when the kernel finishes executing.

sicb0161 · October 10, 2007, 4:15pm

Thx for the answer, but I need the results before it ends, as they are just intermediate results for the computation. :-(

What should i do ?

Jeff_hagen · October 10, 2007, 5:01pm

Right now the only way to be able to be sure it has finished without stopping and relaunching the program is to run your computation in a single block and __syncthreads() when you are done.

This is not an optimal solution as you can get much more performance out of the G80 when you are running many blocks as opposed to only one.

Maybe you can break your program down into parts, where each part only has dependencies on things done within its own block?

sicb0161 · October 10, 2007, 5:14pm

Thx for the reply, but I think the only way to do it for this kind of constellation is to solve the dependencies by sperating the code into 2 kernel invokations. I think that I dont have any other choice. However , if one deals with bigger matrices the vector to shared memory loading is not possible any more, so that other issues become much more important.

thx for the reply again

Topic		Replies	Views
__syncthreads() and global memory CUDA Programming and Performance	1	2462	December 1, 2008
write data from global to shared memory strange thing in SDK sample oclMatrixMul CUDA Programming and Performance	4	1719	February 4, 2010
copying to shared block mem CUDA Programming and Performance	11	4206	April 6, 2008
life span of shared memory CUDA Programming and Performance	15	6990	April 27, 2011
General Shared Memory Question CUDA Programming and Performance	5	6638	March 4, 2010
inter-block communication via global memory why my simple implementation failed? CUDA Programming and Performance	7	14440	December 4, 2007
global memory writing question CUDA Programming and Performance	3	3267	October 7, 2008
Shared memory vs global memory CUDA Programming and Performance	6	3466	April 30, 2007
thread writing into global memory (thread sync) CUDA Programming and Performance	2	1589	August 23, 2009
Getting access to shared memory from different kernels is there a way to share it? CUDA Programming and Performance	4	3777	May 13, 2009

Memory Instructions (kernel) device <-> shared, device <-> device

Related topics