dim3 blockDim(16, 16);
dim3 gridDim((imax + blockDim.x - 1) / blockDim.x, (jmax + blockDim.y - 1) /blockDim.y);
dim3 blockDim_x(256, 1);
dim3 gridDim_x(1, 256);
dim3 blockDim_y(1, 256);
dim3 gridDim_y(256, 1);
In this program, I’m using three different sizes of blocks. It should work fine, but once you change the number of threads in the block to more than 256 it won’t work. For example, dim3 blockDim_x(257, 1); dim3 blockDim_y(1, 257); What’s the problem?
Why does it mean “it won’t work” ? What is the error message?
The program does not report any errors and can be run. But the result is completely wrong, completely inconsistent with what the program should have gotten.
Then you have a bug in your code, or your code is simply not programmed to support more than 256 threads per block.