Register bound kernel help with thread block size

Hi -

I have a kernel that uses 33 registers as reported by cubin. I’m trying to write code that computes a proper blocksize (to the nearest multiple of 16) and gridsize for a given compute device architecture.



// block size (num threads per block): round to nearest multiple of 32


// grid size (num blocks) : ceil( nNumNarrowBand / nBlockSize )

unsigned int nGridSize = iDivUp( nNumNarrowBand, nBlockSize );

const dim3 blockSize( nBlockSize,1,1 );

const dim3 gridSize( nGridSize,1,1 );

for( unsigned int i=0;i<nIterations;i++ )


	computeStep<<< gridSize, blockSize >>>(....);

	CUT_CHECK_ERROR("computeStep kernel failed");


The above code generates: nBlockSize=224=7 warps. The kernel executes (no reported erros) but the results are incorrect (it seems the output array is never written to).

If I manually change nBlockSize=192= 6 warps, it produces the correct results. I assume there is something I’m missing with the number of registers or number of active warps that affects the proper thread block size? Blocksize of 192 infers 42 registers per thread? (8192/192 = 42.666)


Usually having a blocksize that’s a multiple of 64 is better (reduces register bank conflicts).

Anyway, for the exact formula that you’re looking for, look inside the occupancy calculator.

P.S. To get a real error and not a “results are incorrect,” run your code in Debug mode. The CUT_CHECK_ERROR macro doesn’t work otherwise. (Won’t someone please fix this already??)

I think you doesn’t have enough register for 7 warps.

In programming guild says that the maximum register is 8192, but in my experimented program, i never ever can use more than 7296 registers. I don’t know why?

It is mean that 7296/224=32register/thread. but you need 33 registers per thread, right?.

You can use this statement to check error.

printf(“CUDA error: %s\n”, cudaGetErrorString(cudaGetLastError()));