Kernel function doesn't launch with block size >16 Block size of 4, 8, and 16 launch fine

Hi, I’m relatively new to CUDA and am struggling with a rather puzzling problem. I am setting up a kernel function launch like the following:

#define BLOCK_SIZE 16

dim3 neighborlist_dimBlock( BLOCK_SIZE, BLOCK_SIZE );


int neighborlist_dimBlock_w = num_particles/BLOCK_SIZE + ( !(num_particles%BLOCK_SIZE)?0:1);

int neighborlist_dimBlock_h = num_particles/BLOCK_SIZE + ( !(num_particles%BLOCK_SIZE)?0:1);

dim3 neighborlist_dimGrid( neighborlist_dimBlock_w, neighborlist_dimBlock_h );

printf( "neighborlist_dimBlock: %i, %i, %i\n", neighborlist_dimBlock.x, neighborlist_dimBlock.y, neighborlist_dimBlock.z );

printf( "neighborlist dimGrid: %i, %i, %i\n", neighborlist_dimGrid.x, neighborlist_dimGrid.y, neighborlist_dimGrid.z );

update_neighborlist<<< neighborlist_dimGrid, neighborlist_dimBlock >>>( particlesD, neighborlistD );

CUT_CHECK_ERROR( "update_neighborlist" );

And my update_neighborlist kernel function begins like this:

__global__ void update_neighborlist( particle *particlesD, float *neighborlistD )


	unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;

	unsigned int j = blockIdx.y * blockDim.y + threadIdx.y;


  printf( "update_neighborlist %i, %i\n", i, j );


If BLOCK_SIZE is set to 16, everything works great and the kernel function prints out its index (in emulation mode, of course). However, if BLOCK_SIZE is set to 32, 64, or even 512, nothing is printed and no error is given. What could be the issue here? Am I missing something really stupid? If it helps, the output from the initial printf statements with BLOCK_SIZE 32 is:

neighborlist_dimBlock: 64, 64, 1

neighborlist dimGrid: 8, 8, 1

Thank you,

Aaron Thompson

yeah, there’s a max of 512 threads per block (32x32 = 1024). try some of the cuda error checking functions, like cudaGetLastError and cudaGetErrorString.

you’ll need to split your program up into multiple blocks, or make each thread do more work.

Ok, that was it! I feel stupid now. It turns out that the CUT_CHECK_ERROR calls I was using weren’t being used because _DEBUG wasn’t defined in cutil.h.

Thanks for helping a noob,