Copying data into shared memory

Hi all,

This might be a terribly newbie question, but I can’t seem to be able to copy the contents of an array in global memory to an array in shared memory in a device function.

I have an array of spheres in global memory and I want to copy them into shared memory.

Within the device function, the code goes like this:

__shared__ Sphere shSpheres[51];


if (indx < 51)


		shSpheres[indx] = spheres.elements[indx];




In case you are wondering, I’m running this code on a 2D grid and indx is just the “element” inside the grid. It’s being calculated like this: indx = (blockIdx.x * blockDim.x + threadIdx.x) * W + (blockIdx.y * blockDim.y + threadIdx.y), with W = 512.

After running this code, the contents of the shSphere array seem garbled.

Is this the correct way to copy contents of global memory into shared memory? Any ideas on what the problem could be?

Thanks in advance,


In spheres.elements, have you checked that spheres actually has the data you think it has?

Run under the debugger, (or printf in Cuda 2.1) and check that you assign the data you think you assign.

Hi Letharion, thank you for your quick reply.

I am sure spheres.elements has the data because I’m doing some computations with that array a few lines below the snippet I pasted and it works great if I use spheres.elements but it utterly fails when I replace spheres.elements with shSpheres.

if (intersect(base, dir, &(spheres.elements[i]), t0, t1) != 0) //works as expected

{ ... }

if (intersect(base, dir, &(shSpheres[i]), t0, t1) != 0) //errors in computations

{ ... }

I’m using CUDA 2.2, and I don’t have the debugger installed (could not find a version for Ubuntu). Tried using -deviceemu, but it seems broken.


Here is a link about the debugger and ubuntu that might help you:

Otherwise you could use 2.1 and emulation. -deviceemu is indeed “broken”, but deliberately so by nvidia. You will need to go back to 2.1, install the debugger, wait for 2.3 and for the best, and figure it out some other way :) I strongly recommend the debugger, it’s quite bugged[sic] :( but still highly useful.

Hi Letharion, thanks again for your very quick answer!

I’ve managed to make the 64-bit RHEL5 cuda-gdb debugger run on my Ubuntu 8.10 following the above instructions and now I’ve been able to verify the data does not get copied at all:

(cuda-gdb) print shSpheres[i]

$5 = {x = 0, y = 0, z = 0, radius = 0, r = 4.09072034e-36, g = 0, b = 4.04370057e-36}

(cuda-gdb) print spheres.elements[i]

$6 = {x = 0, y = 0, z = 1.75, radius = 0.5, r = 1, g = 0, b = 0}

I’ve been thinking about it and I believe the problem could be due to my array copying, which just executes for the first 51 threads. Even if I am calling __syncthreads(), it is not guaranteed that all thread blocks are getting their shared memory set. Does this make sense? Could it be a possible explanation?


Are you sure you’re indexing your shared memory correctly? It’s usually not enough to simply replace “spheres.elements” with “shSpheres”, the indexing needs to be adjusted too.


Okay, I think I’ve found the problem. I was indeed assigning the shared memory for just the first 51 threads, which was not enough for getting the data copied into all the shared memory chips on the GPU. Iterating over the spheres.elements array and copying into the shared shSpheres array for every thread gets all the data loaded into all chips. The question now would be whether there is a way to do this efficiently -since now obviously the data is being copied from global memory more than once- without having to know the number of processors the kernel is running on.

Best regards,


I very much doubt that explanation can be correct. You are aware that the contents of shared memory is transient and only has the life and context of the block in which it is assigned? So if you were expecting that you can used some threads from a single block to populate shared memory for your whole execution grid, be prepared to be disappointed.

Hi Avidday,

Yes, that was indeed the problem. My logic was failing to ensure that at least one thread in each block got the contents from spheres.elements copied into the shared array. It is interesting to note (from you post) that shared memory has the life cycle of a block, not a Kernel launch.

What I am currently doing now is to have every thread 0,0 of each block copy the array, that way I know each block will have its shared data populated.

Best regards,


You might want to investigate having the first 16 threads (or some mulitple of 16) in a block do the copying to shared memory collaboratively. If you structure the code correctly, the global memory reads will be coalesced and the overall performance of your kernel will increase considerably (in my codes, coalescing the reads gives anything from 2 to 10 times speedup).