Hi, I’m relatively new to CUDA and am struggling with a rather puzzling problem. I am setting up a kernel function launch like the following:
#define BLOCK_SIZE 16
dim3 neighborlist_dimBlock( BLOCK_SIZE, BLOCK_SIZE );
int neighborlist_dimBlock_w = num_particles/BLOCK_SIZE + ( !(num_particles%BLOCK_SIZE)?0:1);
int neighborlist_dimBlock_h = num_particles/BLOCK_SIZE + ( !(num_particles%BLOCK_SIZE)?0:1);
dim3 neighborlist_dimGrid( neighborlist_dimBlock_w, neighborlist_dimBlock_h );
printf( "neighborlist_dimBlock: %i, %i, %i\n", neighborlist_dimBlock.x, neighborlist_dimBlock.y, neighborlist_dimBlock.z );
printf( "neighborlist dimGrid: %i, %i, %i\n", neighborlist_dimGrid.x, neighborlist_dimGrid.y, neighborlist_dimGrid.z );
update_neighborlist<<< neighborlist_dimGrid, neighborlist_dimBlock >>>( particlesD, neighborlistD );
CUT_CHECK_ERROR( "update_neighborlist" );
And my update_neighborlist kernel function begins like this:
__global__ void update_neighborlist( particle *particlesD, float *neighborlistD )
{
unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int j = blockIdx.y * blockDim.y + threadIdx.y;
#ifdef __DEVICE_EMULATION__
printf( "update_neighborlist %i, %i\n", i, j );
#endif
If BLOCK_SIZE is set to 16, everything works great and the kernel function prints out its index (in emulation mode, of course). However, if BLOCK_SIZE is set to 32, 64, or even 512, nothing is printed and no error is given. What could be the issue here? Am I missing something really stupid? If it helps, the output from the initial printf statements with BLOCK_SIZE 32 is:
neighborlist_dimBlock: 64, 64, 1
neighborlist dimGrid: 8, 8, 1
Thank you,
Aaron Thompson