shared memory and threads question

DRONe · August 29, 2007, 2:55pm

I tried do such thing on my 8600gt:

if (!(threadIdx.x%8)) i fill shared memory array with specified data (depending on threadidx.x) from global memory. I use 32 threads (8 per multiprocessor, grid (1,1,1) and block (32,1,1) ) so I should get 4 different arrays in shared memory of each multiprocessor, shouldn’t I? (acording to programming guide page 15 “The way a block is split into warps is always the same; each warp contains threads of consecutive, increasing thread IDs with the first warp containing thread 0.”). But I get the same shared array for each thread. When I try assign shared array values threadIdx.x i got strange result which could be interpreted as threads 0,8,16,24 runs on 1 multiprocessor and they competiteve write to the same shared memory (for example dimension of array is 8. In first 4 elements I got 24, in next 4 elements I got 8 (I repeat loops with 4 threads parallel writing)). Am I doing something wrong External Image ?

Sample of my kernel code (tid=threadIdx.x):

for(j=0; j<ceilf((N-i)/4.0); j++)

   {  

    colJ=j*4+tid/8+i;

    if (colJ<N)

    {

     if (!(tid%8))

     {

    for(k=0; k<i; k++) {ds_U[0]=tid;}

     }

     __syncthreads();

    

        

     if (!(tid%8))

     {

    d_U[i+colJ*N]=ds_U[0];

                                      }

     __syncthreads();

    }

   }

After execution of this code I got same values quartets in one d_U row.

Where might be the problem?

MisterAnderson42 · August 29, 2007, 3:40pm

grid (1,1,1) and block (32,1,1) means that you are only running 1 block with 32 threads in it. Each block has the same shared memory space, hence your results. If you really want 8 threads on each multiprocessor, you need grid(4,1,1) and block(8,1,1).

DRONe · August 30, 2007, 8:03am

Thank you very much. You even can’t imagine how useful it is not only in this one project.

But I’ve got one more question: cuold I somehow obtain grid index?

pluk · August 30, 2007, 9:28am

threadIdx.(x,y,z) gives index of thread within a block

blockIdx.(x,y,z) gives index of block within a grid

DRONe · August 30, 2007, 9:31am

Thanks, but I need GRID index within kernel.

pluk · August 30, 2007, 9:49am

a kernel has only 1 grid…

DRONe · August 30, 2007, 9:55am

Why?

I use grid (4,1,1) as posted earlier by mrAnderson. And each grid executed on single multiprocessor (am I rigth?) So I need grid index.x during realtime execution.

pluk · August 30, 2007, 11:25am

Not really, the naming is a little bit confusing:

dim3 grid(4,1,1); // means the dimensions of the grid are 4x1x1 so only 1 grid is used, containing 4 blocks.

dim3 block(8,1,1); // means the dimensions of the blocks are 8x1x1 so all 4 blocks contain 8 threads.

Therefore you only need threadIdx and blockIdx

DRONe · August 30, 2007, 12:14pm

Thank you for your explanation, now everything is fine.

And one more question:

If my blocks number == multiprocessors number all threads of one block will be executed on the same multiprocessor and all blocks will be executed paralelly each on one multiprocessor or blocks will be executed sequentilly on the whole device each?

alex_dubinsky · August 30, 2007, 4:39pm

You should read the CUDA Programming Guide carefully. A block should contain a multiple of 32 threads. One block can never be split among multiprocessors, but one multiprocessor can be running several blocks simultaneously. The thing is that multiprocessors are extremely hyperthreaded, and the more threads that they’re working on the more efficient they are. They’ll try to load up as many threads as possible until they run out of registers (32kb on G80) or a limit of 768 threads per multiproc.

In your example of 4 blocks of 8 threads each, it’s possible the gpu loads all of them into a single multiprocessor.

To find out how many registers your kernel is using, add the option -keep to nvcc and dig through the resultant garbage for .cubin files.

Also, rtfm

Topic		Replies	Views
the relation between Thread Index and Shared Memory CUDA Programming and Performance	4	3236	February 14, 2009
number of threads and registers CUDA Programming and Performance	10	4867	March 14, 2008
The choose of grid size and block size CUDA Programming and Performance	8	3246	May 8, 2024
Not enough shared mem CUDA Programming and Performance	5	5770	November 3, 2009
Can I Control Thread ID? CUDA Programming and Performance	3	4355	June 9, 2008
Execution Of Thread-Blocks CUDA Programming and Performance	4	5282	June 18, 2007
Lots of Threads vs. Shared Memory CUDA Programming and Performance	9	8351	February 12, 2008
Shared memory using structure instead of array CUDA Programming and Performance	7	1325	February 29, 2020
General CUDA Questions New to CUDA and need some help! CUDA Programming and Performance	8	5982	September 5, 2008
Distribution of Threads to Multiprocessors CUDA Programming and Performance	8	13610	June 8, 2011

shared memory and threads question

Related topics