do not understand thread/block division


I have a simple kernel:

__global__ void fwd_conv_shared( int *d_output, int nData )


int x = __umul24( blockIdx.x, blockDim.x ) + threadIdx.x  ;

bool invalid = ( x < nData );


	if( invalid )


		//d_output[ x ] = threadIdx.x;

		d_output[ x ] = blockIdx.x;	




and when I call it this way:

dim3 block(8, 1, 1);

		dim3 gridSize( 1, 1, 1 ); 


		fwd_conv_shared<<< gridSize, block>>>( (int *)dDstData, nData );

I obtain in dDstData (nData = 8):

0, 0, 0, 0, 0, 0, 0, 0

and when list threadIdx.x:

0, 1, 2, 3, 4, 5, 6, 7

OK, but, when I call it this way:

dim3 blockSize(8, 1, 1);

		dim3 gridSize( 3, 1, 1 );  //!!


		fwd_conv_shared<<< gridSize, blockSize>>>( (int *)dDstData, nData );

I’ve got:

threadIdx.x = 0, 1, 2, 3, 4, 0, 1, 2


blockIdx.x = 0, 0, 0, 0, 0, 1, 1, 1

I expected, that when the size of the block is 8, the kernel will be started always on 8 threads in block.

I do not understand, why it was divided into two blocks. How to make CUDA, to run fixed number of threads in a block?

It is nightmare, when I try to use blockIdx.x to index shared memory.

I’ll be very grateful for your answers,

Best regards,


//CUDA 4.1, GTX480

You should define dDstData if the size of the problem. When you submit with 3 blocks with 8 threads each you have 24 threads but ndata is only 8. Also I do not understant this line int x = __umul24( blockIdx.x, blockDim.x ) + threadIdx.x ;

It is ok to use int x = blockIdx.x*blockDim.x + threadIdx.x ;

thank you pasoleatis,

dDstData is allocated as 8 element device linear memory.

__umul24( blockIdx.x, blockDim.x ) + threadIdx.x ;

is an equivalent of

int x = blockIdx.x*blockDim.x + threadIdx.x;

I now, that in the second case I run 24 threads on 8 element data table, however I thought, that all data will be proceeded in first block, and in blocks second and third “invalid” will be “false”.
I don’t know, why BlockDim.x is 5 instead of 8.



Yes you are right there should be no problem with your code. Check the last error from the kernel call.

Just a thought use this

fwd_conv_shared<<< 3,8>>>( (int *)dDstData, nData );

to call the kernel

void checkCUDAError(const char *msg)


    cudaError_t err = cudaGetLastError();

    if( cudaSuccess != err) 


        fprintf(stderr, "Cuda error: %s: %s.\n", msg, 

                             cudaGetErrorString( err) );




// check if kernel execution generated an error

    // Check for any CUDA errors

    checkCUDAError("kernel invocation");

Source: CUDA, Supercomputing for the Masses: Part 3 | Dr Dobb's

I’ checked cuda errors with the “no error” result.

However, I did not paste whole kernel code above.

In fact I have:

__global__ void fwd_conv_shared( int *d_output, int nData )


int shift = 3 * blockIdx.x;

        int x = __umul24( blockIdx.x, blockDim.x ) + threadIdx.x  ;


bool invalid = ( x < nData );

if( invalid )


                //d_output[ x ] = threadIdx.x;

                d_output[ x ] = blockIdx.x;     



But for the first block, shift should be 0, so I did not paste this part.

However, when I remove the part with “shift”, or set shift manually to 0, everything seems to be ok.

The difference with shift is that “invalid” is set after the subtraction of shift. This means that the first three cases for blockIdx.x=1 also will be valid since they have x=5,6,7. In your case, It appears that the block with blockIdx.x=1 is executed after the block with blockIdx.x=0 and thereby overwrites the results of the block with blockIdx.x=0.

By the way the __mul24 function is slower on Fermi then just doing the normal integer multiplication

Thank you very, very much for the explanation. I’ve got one last question.

now I have nData = 8;

I run the kernel:

__global__ void fwd_conv_shared( int *d_output, int nData )


        int x = (int)( blockIdx.x * ( blockDim.x - 3 ) + threadIdx.x );

        d_output[ x ] = blockIdx.x;       


(Each next block overlaps last three elements of the previous block.)

fwd_conv_shared<<< 5, 4>>>( (float *)dDstData, nData );

so I expected to have the result like:

0, 1, 2, 3, 4, 4, 4, 4

but I’ve got:

0, 1, 2, 2, 2, 2, 4, 4

Why the block number 2 overlaps data of the block number 4 if he was executed before the execution of the last block?

I tried to use atomicExch(), __threadfence() and __syncthreads() with the same result.


You expect a specific order of execution for the blocks. In practice there is no order in which the blocks are executed. The atomic operation insures that if an adress is read no other thread can acess that address until it is written back. The thredfence function makes sure the result is visible to the global memory by the other threads, while the synthreads() only synchronises the threads in a given block. None of them imposes an order like block 1 should be done before block 2.

In the programming guide there is an example about how you can make the last block to be executed to perform a certain task on data provided by other blocks, but I do not think it is possible to make an order of eecution such as block 0 is ran, then block 1 and so on.

This code is for reduction like sum. Each block makes the partial sum writes the results to the global memory and the last block to be executed makes the final summation.

__device__ unsigned int count = 0;

__shared__ bool isLastBlockDone;

__global__ void sum(const float* array, unsigned int N,

float* result)


// Each block sums a subset of the input array

float partialSum = calculatePartialSum(array, N);

if (threadIdx.x == 0) {

// Thread 0 of each block stores the partial sum

// to global memory

result[blockIdx.x] = partialSum;

// Thread 0 makes sure its result is visible to

// all other threads


// Thread 0 of each block signals that it is done

unsigned int value = atomicInc(&count, gridDim.x);

// Thread 0 of each block determines if its block is

// the last block to be done

isLastBlockDone = (value == (gridDim.x - 1));


// Synchronize to make sure that each thread reads

// the correct value of isLastBlockDone


if (isLastBlockDone) {

// The last block sums the partial sums

// stored in result[0 .. gridDim.x-1]

float totalSum = calculateTotalSum(result);

if (threadIdx.x == 0) {

// Thread 0 of last block stores total sum

// to global memory and resets count so that

// next kernel call works properly

result[0] = totalSum;

count = 0;


Ok, thank you, I understand it now.

The thing I wanted to do to implement my algorithm was something like:

int x = (int)( blockIdx.x * ( blockDim.x - 3 ) + threadIdx.x );

d_output[ x ] += something;

Next block adds something to the last three elements of previous block. It’s something like gaussian forward operator, or any other forward pixel operator, which spreads the pixel value in new coordinate system.

Becouse I get an error I wanted to understand how the blocks are executing.

I tried to call synthreads() and memory synchronization functions becouse I thought, that there is something like read after write error between blocks what was not understandable for me.

It is interesting, that in code above atomicAdd instead += works correctly, however I do not want to use atomic functions because they can kill the performance.

In the case above you have no option because you have a race condition. You either use atomiAdd if there are not many additions or you need some other way to make the summation of the partial results, like defining an array of the size of number of blocks so each block will save the partial result to a global temp array.