Why does changing grid and block configuration from dim3(8, 1, 1) to dim3(8, 8, 1) make the CUDA kernel work?

Body:

Hello CUDA community,

I’m working on a CUDA kernel with a 2D thread block (64 threads in the X direction and 8 threads in the Y direction). Initially, I launched the kernel with the following configuration:

fortran

复制代码

blkd = dim3(8, 8, 1)
grdd = dim3(8, 1, 1)
CALL XDXEE<<<grdd, blkd>>>

This doesn’t work as expected. However, when I change the grid size to:

fortran

复制代码

blkd = dim3(8, 8, 1)
grdd = dim3(8, 8, 1)
CALL XDXEE<<<grdd, blkd>>>

The kernel works correctly. Both configurations use the same total number of threads (512), so I’m wondering why the second configuration works while the first one doesn’t.

Thank you for your help!


Additional Info:

  • CUDA version: [specify version]
  • GPU model: [specify GPU model]
  • OS: [specify OS]

CUDA doesn’t provide any guarantees that kernels will work correctly regardless of how the grid is specified. You can certainly design a code that will require specific grid design to work correctly.

That doesn’t appear to be a correct statement based on what you have shown. In the first case:

You have 8 blocks each of which has 8x8=64 threads, for a total of 512 threads.

In the second case:

you have 8x8=64 blocks each of which has 8x8=64 threads, for a total of 4096 threads.

Even if that is a typo and you meant to post something else, I would go back to what I have already said. I won’t be able to provide further suggestions even if what you meant was:

blkd = dim3(8, 1, 1)
grdd = dim3(8, 8, 1)