I’m working on a CUDA kernel with a 2D thread block (64 threads in the X direction and 8 threads in the Y direction). Initially, I launched the kernel with the following configuration:
The kernel works correctly. Both configurations use the same total number of threads (512), so I’m wondering why the second configuration works while the first one doesn’t.
CUDA doesn’t provide any guarantees that kernels will work correctly regardless of how the grid is specified. You can certainly design a code that will require specific grid design to work correctly.
That doesn’t appear to be a correct statement based on what you have shown. In the first case:
You have 8 blocks each of which has 8x8=64 threads, for a total of 512 threads.
In the second case:
you have 8x8=64 blocks each of which has 8x8=64 threads, for a total of 4096 threads.
Even if that is a typo and you meant to post something else, I would go back to what I have already said. I won’t be able to provide further suggestions even if what you meant was: