shared memory and threads question

I tried do such thing on my 8600gt:

if (!(threadIdx.x%8)) i fill shared memory array with specified data (depending on threadidx.x) from global memory. I use 32 threads (8 per multiprocessor, grid (1,1,1) and block (32,1,1) ) so I should get 4 different arrays in shared memory of each multiprocessor, shouldn’t I? (acording to programming guide page 15 “The way a block is split into warps is always the same; each warp contains threads of consecutive, increasing thread IDs with the first warp containing thread 0.”). But I get the same shared array for each thread. When I try assign shared array values threadIdx.x i got strange result which could be interpreted as threads 0,8,16,24 runs on 1 multiprocessor and they competiteve write to the same shared memory (for example dimension of array is 8. In first 4 elements I got 24, in next 4 elements I got 8 (I repeat loops with 4 threads parallel writing)). Am I doing something wrong :ermm: ?

Sample of my kernel code (tid=threadIdx.x):

for(j=0; j<ceilf((N-i)/4.0); j++)



    if (colJ<N)


     if (!(tid%8))


    for(k=0; k<i; k++) {ds_U[0]=tid;}





     if (!(tid%8))







After execution of this code I got same values quartets in one d_U row.

Where might be the problem?

grid (1,1,1) and block (32,1,1) means that you are only running 1 block with 32 threads in it. Each block has the same shared memory space, hence your results. If you really want 8 threads on each multiprocessor, you need grid(4,1,1) and block(8,1,1).

Thank you very much. You even can’t imagine how useful it is not only in this one project.

But I’ve got one more question: cuold I somehow obtain grid index?

threadIdx.(x,y,z) gives index of thread within a block

blockIdx.(x,y,z) gives index of block within a grid

Thanks, but I need GRID index within kernel.

a kernel has only 1 grid…


I use grid (4,1,1) as posted earlier by mrAnderson. And each grid executed on single multiprocessor (am I rigth?) So I need grid index.x during realtime execution.

Not really, the naming is a little bit confusing:

dim3 grid(4,1,1); // means the dimensions of the grid are 4x1x1 so only 1 grid is used, containing 4 blocks.

dim3 block(8,1,1); // means the dimensions of the blocks are 8x1x1 so all 4 blocks contain 8 threads.

Therefore you only need threadIdx and blockIdx

Thank you for your explanation, now everything is fine.

And one more question:

If my blocks number == multiprocessors number all threads of one block will be executed on the same multiprocessor and all blocks will be executed paralelly each on one multiprocessor or blocks will be executed sequentilly on the whole device each?

You should read the CUDA Programming Guide carefully. A block should contain a multiple of 32 threads. One block can never be split among multiprocessors, but one multiprocessor can be running several blocks simultaneously. The thing is that multiprocessors are extremely hyperthreaded, and the more threads that they’re working on the more efficient they are. They’ll try to load up as many threads as possible until they run out of registers (32kb on G80) or a limit of 768 threads per multiproc.

In your example of 4 blocks of 8 threads each, it’s possible the gpu loads all of them into a single multiprocessor.

To find out how many registers your kernel is using, add the option -keep to nvcc and dig through the resultant garbage for .cubin files.

Also, rtfm