I’ve been trying to study performance as a function of grid size. The attached loglog graph shows times in seconds (y-axis) to perform the same problem using a 16x16x1 block. The output is a square matrix, so I looked at square grids from 1x1 to 100x100; the graph x-axis shows the total number of blocks. The kernel copies inputs into shared memory and then computes the matrix values. The kernel was written for performance testing, so each block determines how much of the output matrix it will need to compute and iterates over the copy/compute process as many times as necessary. This is so that each launch of the kernel performs the same amount of work, no matter the grid dimensions. (With a grid size of 2x2, for example, each block computes 1/4 of the total outputs.)
I get very similar results for any nxn block size. I realize that overlaying a 16x16 thread array onto varying sizes of output submatrices will certainly waste some processing, but I had not expected (and initial analyses do not show) this great an effect.
Has anyone seen anything like this? Is there a simple explanation?