A very simple question I belive:
The output from the deviceQuery from the CUDA samples, shows me I have this max sizes:
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
I know that I can “confortably” work with this configs:
However, If I’m guessing correctly, I cannot work with something like this:
It is not correct to submit 1024x1024. The total number of thread should 1024 totally, for the numbers of blocks is ok what you did. If you use dim3 threads(1024,1024) the kernels will not be executed. YOu could use for example dim3 threads(32,32).
If you have dim3 threads(tx,ty,tz) you have the following rules txtytz<=1024, ty<=1024,ty<=1024,tz<=64.