I’m having trouble interpreting the counters of the cuda profiler. Especially coalesced stores/writes; these seem to fluctuate in measurements of the same program and data size. Even in a simple 1-dimensional memory-fill program, counters like gst64 fluctuate from measurement to measurement.
Now, I have already ensured that my occupancy is 1, grid size is aligned to the number of SMs and I’m sure that the writes are completely coalesced. What can be the cause of these fluctuations?
You need to keep in mind that the profiler only instruments a couple of multiprocessors and then scales the profile count up to the total number of multiprocessors on the card. So if there is variation in the workload between blocks, the same kernel run on the same data can return different profiling results. Also I suspect (this is just a guess) that the profiler mechanism is only sampling MP activity at a fixed frequency, so there will be some aliasing effects which will produce variation in output for the same input, depending on timing.
It is probably best to take a statistical, rather than literal interpretation of the counters.
Thank you for your reply,
I believe you’re right; the “jitter” in measurements may well be caused because of profiling-sampling in a different frequency domain than the kernel-execution-time. I plotted all the profiler values for gst64 and the plot is linear with only troughs, no spikes. This fits the interval-sampling theory you suggested.
What I still do not understand is the exact relationship with gridsize versus gst-count. I expect that the measured value is somewhere near;
(grid_size * block_size * sizeof(float) ) / 64
This formula fits for the first case of my results (I only use 1-dimensional arrays for this test); ( 240 * 256 * sizeof(float) ) / 64 = 3840 transactions. I measure ~3500 transactions in the profiler.
However, when I perform this test with increasing grid-sizes (increasing by 240), the transactions should increase with 3840 as well. What I see however is that these increase with ~1720 transactions instead of 3840.
This means that at the 13th increment, the number of transactions that I theoretically need is twice the number of transaction that I actually measure. Where did the other half go I wonder?
Either the profiler is blind as a bat or my formula is wrong (I suspect the latter).
Do you (or any other reader) have any insights, suggestions or ideas in this matter?
There is empirical evidence to suggest that the MP scheduling mechanism on GT200 and earlier cards (don’t know about Fermi yet) is very simple - MP get “filled” with blocks to capacity (so the scheduling unit is active blocks per multiprocessor), which they run until all blocks are finished, then they get filled again. So when you are looking at profiling and performance results versus grid size, there can be some scheduling related surprises which might not look all that intuitive at first inspection.
Because of the reasons I mentioned in my first reply, you can expect that the profiler counters will be “curious” on cards with a high MP count whenever the total number of blocks in the grid is not a even multiple of (active blocks per MP)*(number of MP on card), because the total number of active blocks on an instrumented MP is not guaranteed to be the same otherwise. For the same reason, run times tend to be a series of plateaus with increasing block count, with jumps corresponding to grid sizes just larger than that which evenly fills all MP on the card.
I think this probably explains what you are seeing - 240 might not the be right increment to get predictable scaling in MP counters, but rather 30 * blocks per MP should be. You can use the occupancy calculation spreadsheet to work out what the kernel you have is doing.