When I run my kernel,

integrate_kernel<<<dimGrid, dimBlock>>>(foo,bar);


int gridx = 16;

int gridy = 16;

int blockx = 16;

int blockx = 16;

dim3 dimGrid(gridx, gridy, 1);

dim3 dimBlock(blockx, blocky, 1);

I get no problems, but if I use 32 instead of 16 throughout, I get an error; found by calling ‘cudaGetLastError()’ immediately after the kernel call. The cudaError is ‘cudaErrorInvalidConfiguration’, and the string is “invalid configuration argument”.

I can see that I’m getting a much faster runtime using 32, but I’m concerned about this (non-halting) error. I’m not using a lot of threads here. Any thoughts?

32*32 = 1024. If you look in the programming guide you will see that the maximum amount of threads per block is 512. That is why you get the error.

The reason it is faster is that it has not run at all.