A very simple question I belive:
The output from the deviceQuery from the CUDA samples, shows me I have this max sizes:

Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535

I know that I can “confortably” work with this configs:

dim3 blocks(65535);
dim3 threads(1024);
kernel();

However, If I’m guessing correctly, I cannot work with something like this:

Because I have a maximum of 1024 threads per block, and I’m actually requesting 1024 per block in each dimension (giving 1024x1024 max threads), is this correct?

It is not correct to submit 1024x1024. The total number of thread should 1024 totally, for the numbers of blocks is ok what you did. If you use dim3 threads(1024,1024) the kernels will not be executed. YOu could use for example dim3 threads(32,32).

If you have dim3 threads(tx,ty,tz) you have the following rules txtytz<=1024, ty<=1024,ty<=1024,tz<=64.