Hi,
I have a simple kernel:
__global__ void fwd_conv_shared( int *d_output, int nData )
{
int x = __umul24( blockIdx.x, blockDim.x ) + threadIdx.x ;
bool invalid = ( x < nData );
if( invalid )
{
//d_output[ x ] = threadIdx.x;
d_output[ x ] = blockIdx.x;
}
}
and when I call it this way:
dim3 block(8, 1, 1);
dim3 gridSize( 1, 1, 1 );
fwd_conv_shared<<< gridSize, block>>>( (int *)dDstData, nData );
I obtain in dDstData (nData = 8):
0, 0, 0, 0, 0, 0, 0, 0
and when list threadIdx.x:
0, 1, 2, 3, 4, 5, 6, 7
OK, but, when I call it this way:
dim3 blockSize(8, 1, 1);
dim3 gridSize( 3, 1, 1 ); //!!
fwd_conv_shared<<< gridSize, blockSize>>>( (int *)dDstData, nData );
I’ve got:
threadIdx.x = 0, 1, 2, 3, 4, 0, 1, 2
and:
blockIdx.x = 0, 0, 0, 0, 0, 1, 1, 1
I expected, that when the size of the block is 8, the kernel will be started always on 8 threads in block.
I do not understand, why it was divided into two blocks. How to make CUDA, to run fixed number of threads in a block?
It is nightmare, when I try to use blockIdx.x to index shared memory.
I’ll be very grateful for your answers,
Best regards,
Jakub
//CUDA 4.1, GTX480