Kernel function doesn't launch with block size >16 Block size of 4, 8, and 16 launch fine

aaroneous88 · July 28, 2008, 3:12am

Hi, I’m relatively new to CUDA and am struggling with a rather puzzling problem. I am setting up a kernel function launch like the following:

#define BLOCK_SIZE 16

dim3 neighborlist_dimBlock( BLOCK_SIZE, BLOCK_SIZE );

	

int neighborlist_dimBlock_w = num_particles/BLOCK_SIZE + ( !(num_particles%BLOCK_SIZE)?0:1);

int neighborlist_dimBlock_h = num_particles/BLOCK_SIZE + ( !(num_particles%BLOCK_SIZE)?0:1);

dim3 neighborlist_dimGrid( neighborlist_dimBlock_w, neighborlist_dimBlock_h );

printf( "neighborlist_dimBlock: %i, %i, %i\n", neighborlist_dimBlock.x, neighborlist_dimBlock.y, neighborlist_dimBlock.z );

printf( "neighborlist dimGrid: %i, %i, %i\n", neighborlist_dimGrid.x, neighborlist_dimGrid.y, neighborlist_dimGrid.z );

update_neighborlist<<< neighborlist_dimGrid, neighborlist_dimBlock >>>( particlesD, neighborlistD );

CUT_CHECK_ERROR( "update_neighborlist" );

And my update_neighborlist kernel function begins like this:

__global__ void update_neighborlist( particle *particlesD, float *neighborlistD )

{

	unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;

	unsigned int j = blockIdx.y * blockDim.y + threadIdx.y;

	#ifdef __DEVICE_EMULATION__

  printf( "update_neighborlist %i, %i\n", i, j );

	#endif

If BLOCK_SIZE is set to 16, everything works great and the kernel function prints out its index (in emulation mode, of course). However, if BLOCK_SIZE is set to 32, 64, or even 512, nothing is printed and no error is given. What could be the issue here? Am I missing something really stupid? If it helps, the output from the initial printf statements with BLOCK_SIZE 32 is:

neighborlist_dimBlock: 64, 64, 1

neighborlist dimGrid: 8, 8, 1

Thank you,

Aaron Thompson

gatoatigrado · July 28, 2008, 3:36am

yeah, there’s a max of 512 threads per block (32x32 = 1024). try some of the cuda error checking functions, like cudaGetLastError and cudaGetErrorString.

you’ll need to split your program up into multiple blocks, or make each thread do more work.

aaroneous88 · July 28, 2008, 1:40pm

Ok, that was it! I feel stupid now. It turns out that the CUT_CHECK_ERROR calls I was using weren’t being used because _DEBUG wasn’t defined in cutil.h.

Thanks for helping a noob,

Aaron