Strange performance relationship to grid dimension?

I’ve been trying to study performance as a function of grid size. The attached loglog graph shows times in seconds (y-axis) to perform the same problem using a 16x16x1 block. The output is a square matrix, so I looked at square grids from 1x1 to 100x100; the graph x-axis shows the total number of blocks. The kernel copies inputs into shared memory and then computes the matrix values. The kernel was written for performance testing, so each block determines how much of the output matrix it will need to compute and iterates over the copy/compute process as many times as necessary. This is so that each launch of the kernel performs the same amount of work, no matter the grid dimensions. (With a grid size of 2x2, for example, each block computes 1/4 of the total outputs.)

I get very similar results for any nxn block size. I realize that overlaying a 16x16 thread array onto varying sizes of output submatrices will certainly waste some processing, but I had not expected (and initial analyses do not show) this great an effect.

Has anyone seen anything like this? Is there a simple explanation?


What GPU are you on, and how many blocks can you fit on each MP?

If, for example, I have 30 muliprocessors each of which can run 3 blocks a grid size of 90 will allow me to launch all my blocks at once. A grid size of 91 will result in the 91st block having to wait to be launched, which will obivously slow things down.

Looking at the graph I’d guess you have 25 (or just over 25) concurrent blocks.

The general trend of more blocks => faster execution may well be something to do with your load scaling.