Invalid Configuration Argument

Hello all, I’m studying CUDA and trying to optimize some test code and I reached a point were I’m clearly missing something.
I have a GTX 1060 and according to the deviceQuery this is the maximum threads and blocks:

Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)

So, I expected that a code like this should work:

#define POP_SIZE 160
#define LEN_SIZE 20

...
                
                dim3 threadsPerBlock(32,32);
		dim3 numBlocks(POP_SIZE/threadsPerBlock.x, LEN_SIZE/threadsPerBlock.y);
		fitness<<<numBlocks, threadsPerBlock>>>(d_dest, d_pop);
		CHECK(cudaPeekAtLastError());
		CHECK(cudaMemcpy(h_pop, d_pop, sizeof(individual)*POP_SIZE, cudaMemcpyDeviceToHost));
...

Even though 32x32=1024 threads per block, I’m getting the “Invalid configuration error”. After searching on the CUDA Programing Guide, I always found that the maximum amount of threads is 1024, it’s pretty clear that at page 9:

There is a limit to the number of threads per block, since all threads of a block are
expected to reside on the same processor core and must share the limited memory
resources of that core. On current GPUs, a thread block may contain up to 1024 threads.

If however I change the dimension of the threadsPerBlock to (32,20), it works flawlessly. So what gives? What am I understanding wrong about kernel launch sizes?

If interest, here’s the kernel:

__global__ void fitness(char s_dest[LEN_SIZE], individual *pop)
{
	unsigned int pop_idx = threadIdx.x + blockDim.x * blockIdx.x;
	unsigned int str_idx = threadIdx.y + blockDim.y * blockIdx.y;
	individual *ind = NULL;
	if(pop_idx < POP_SIZE && str_idx < LEN_SIZE)
	{
		ind = &pop[pop_idx];
		unsigned int l_fit = abs( (int)ind->s[str_idx] - (int)s_dest[str_idx]);
		atomicAdd(&ind->fitness, l_fit);
	}
}

I don’t like the atomicAdd() there, but this is a reason for another topic.

Thank you all.

In your case

LEN_SIZE/threadsPerBlock.y

(= 20/32) is zero, which is not a valid value for a dimension (it needs to be > 0).

Lord, thank you! I was so fixed on the thread size that forgot about this.