Problem with performance when gridDim.x > 65536

Greetings all,

I am currently working on a blocked version of the Floyd-Warshall algorithm using the implementation that CudaaduC posted here as the template and have run into a rather serious performance issue. Namely, when I pass 65,636 blocks on the grid for a certain kernel, the runtime skyrockets (as in, sub-2ms to over-10ms).

I’m running CUDA 10.1 on a GTX 1070 Ti. Each block/tile (dim3(32, 32, 1)) actually represents a 32x32 section of the graph matrix and shared memory is utilized to speed up processing. I have no problems when I am processing blocks on a given column, but when I pass more than 65,536 blocks on the grid, which is dim3(num_tiles, 1, 1), the runtime for the next kernel, which does the same thing, only moving across a given row, skyrockets. For V=8100, the runtime is fine. For V=8192 (65536 blocks for 32x32 blocks), the runtime is fine. For V=8200, the runtime skyrockets, but the program continues to run - it does not crash.

I can post the relevant code, but since it will take me some time to prepare it (it’s not fully commented), I wanted to go on and ask w/o it, since I’m not sure if the problem doesn’t simply lie in my grid/block configuration (I’ve never used a 2D shared memory array in a 1D grid).

I appreciate any and all help!

Charles Johnson

PS The graph matrix is in a 1D array and I simply access the data using calculated offsets.

(1) Make sure you are compiling your code for the correct target architecture(s).
(2) Use cuda-memcheck to ensure sure there are no obvious issues with your code. Fix all issues cuda-memcheck reports.