if (!(threadIdx.x%8)) i fill shared memory array with specified data (depending on threadidx.x) from global memory. I use 32 threads (8 per multiprocessor, grid (1,1,1) and block (32,1,1) ) so I should get 4 different arrays in shared memory of each multiprocessor, shouldn’t I? (acording to programming guide page 15 “The way a block is split into warps is always the same; each warp contains threads of consecutive, increasing thread IDs with the first warp containing thread 0.”). But I get the same shared array for each thread. When I try assign shared array values threadIdx.x i got strange result which could be interpreted as threads 0,8,16,24 runs on 1 multiprocessor and they competiteve write to the same shared memory (for example dimension of array is 8. In first 4 elements I got 24, in next 4 elements I got 8 (I repeat loops with 4 threads parallel writing)). Am I doing something wrong External Image ?
Sample of my kernel code (tid=threadIdx.x):
for(j=0; j<ceilf((N-i)/4.0); j++)
{
colJ=j*4+tid/8+i;
if (colJ<N)
{
if (!(tid%8))
{
for(k=0; k<i; k++) {ds_U[0]=tid;}
}
__syncthreads();
if (!(tid%8))
{
d_U[i+colJ*N]=ds_U[0];
}
__syncthreads();
}
}
After execution of this code I got same values quartets in one d_U row.
grid (1,1,1) and block (32,1,1) means that you are only running 1 block with 32 threads in it. Each block has the same shared memory space, hence your results. If you really want 8 threads on each multiprocessor, you need grid(4,1,1) and block(8,1,1).
I use grid (4,1,1) as posted earlier by mrAnderson. And each grid executed on single multiprocessor (am I rigth?) So I need grid index.x during realtime execution.
Thank you for your explanation, now everything is fine.
And one more question:
If my blocks number == multiprocessors number all threads of one block will be executed on the same multiprocessor and all blocks will be executed paralelly each on one multiprocessor or blocks will be executed sequentilly on the whole device each?
You should read the CUDA Programming Guide carefully. A block should contain a multiple of 32 threads. One block can never be split among multiprocessors, but one multiprocessor can be running several blocks simultaneously. The thing is that multiprocessors are extremely hyperthreaded, and the more threads that they’re working on the more efficient they are. They’ll try to load up as many threads as possible until they run out of registers (32kb on G80) or a limit of 768 threads per multiproc.
In your example of 4 blocks of 8 threads each, it’s possible the gpu loads all of them into a single multiprocessor.
To find out how many registers your kernel is using, add the option -keep to nvcc and dig through the resultant garbage for .cubin files.