Hello all, I’m studying CUDA and trying to optimize some test code and I reached a point were I’m clearly missing something.

I have a GTX 1060 and according to the deviceQuery this is the maximum threads and blocks:

```
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
```

So, I expected that a code like this should work:

```
#define POP_SIZE 160
#define LEN_SIZE 20
...
dim3 threadsPerBlock(32,32);
dim3 numBlocks(POP_SIZE/threadsPerBlock.x, LEN_SIZE/threadsPerBlock.y);
fitness<<<numBlocks, threadsPerBlock>>>(d_dest, d_pop);
CHECK(cudaPeekAtLastError());
CHECK(cudaMemcpy(h_pop, d_pop, sizeof(individual)*POP_SIZE, cudaMemcpyDeviceToHost));
...
```

Even though 32x32=1024 threads per block, I’m getting the “Invalid configuration error”. After searching on the CUDA Programing Guide, I always found that the maximum amount of threads is 1024, it’s pretty clear that at page 9:

```
There is a limit to the number of threads per block, since all threads of a block are
expected to reside on the same processor core and must share the limited memory
resources of that core. On current GPUs, a thread block may contain up to 1024 threads.
```

If however I change the dimension of the threadsPerBlock to (32,20), it works flawlessly. So what gives? What am I understanding wrong about kernel launch sizes?

If interest, here’s the kernel:

```
__global__ void fitness(char s_dest[LEN_SIZE], individual *pop)
{
unsigned int pop_idx = threadIdx.x + blockDim.x * blockIdx.x;
unsigned int str_idx = threadIdx.y + blockDim.y * blockIdx.y;
individual *ind = NULL;
if(pop_idx < POP_SIZE && str_idx < LEN_SIZE)
{
ind = &pop[pop_idx];
unsigned int l_fit = abs( (int)ind->s[str_idx] - (int)s_dest[str_idx]);
atomicAdd(&ind->fitness, l_fit);
}
}
```

I don’t like the atomicAdd() there, but this is a reason for another topic.

Thank you all.