odd block size performance results

I am working on characterizing the performance of the gpu, but I am seeing some weird upticks on the exponential curve. Below is a chart that shows the optimal cost per element in a stream. I think in that chart, I am fixing the number of blocks to 400 that for this code gives an stable optimal performance.

External Image

In this next chart it is kind of a zoom in of the first part of thr graph with finer detail. Each column for a block size going 1 to 55. This time I varried the number of blocks from 1 to 512. That is what makes up the disribution of points in each block size column. The top bound is when the number of blocks is 1 and the bottom bound or optimal performance is 512.

[IMG]I am working on characterizing the performance of the gpu, but I am seeing some weird upticks on the exponential curve. Below is a chart that shows the optimal cost per element in a stream. I think in that chart, I am fixing the number of blocks to 400 that for this code gives an stable optimal performance.

External Image

So this is what is confusing me. On the graphs you can see for the first half (block size < 192) that after multiples of 16 there are sharp up ticks for the lower bound. It doesn’t make since to me why this is. There are 16 multprocessors, but I thought that only affected the performance based number of blocks. there are 8 stream processors per multiprocessor, but that isn’t exactly sixteen. Can someone tell me what is going on here?

The other thing that I am confused on is that for block size less than 192 there is a big difference between max and min performance that eventually convergers at 192. After 192, the bound between max and min is very tight regardless of the number of blocks. What is that the case?

Thanks for the help.

What is the X axis for the second plot? Threads per block? Something else?

Yes, the x axis is threads per block.

So 16 threads = 1/2 warp, which is interesting. You would get these kinds of jumps if there was a fixed cost to add another half-warp when the block size reaches 16, 32, 48, etc, that is then amortized as you keep growing the block, until you hit the next multiple of 16. But threads are scheduled in blocks of 32, so I’m not sure about the steps in between.

Assuming your code reads or writes gmem, the upticks are influenced by coalescing rules. For example, when a 8-thread block reads gmem and satisfies coalescing conditions, the amount of data fetched is the same as would be for a 16-thread block (coalescing allows reading/writing blocks of certain sizes). So, a 17-thread block is moving nearly twice the amount of memory necessary.

In general, I recommend blocks of sizes 128 and higher. Looking at larger blocks you should see that the “uptics” are much less significant. For example, consider the overhead of reading 32 words of instead of 17; vs, say, 272 instead of 257).

Paulius