I used cuda profiler, and looked at the numbers such as gld 32, gld 64b, etc.
The problem is that those numbers did not seem to be correct.
I tested a very simple program so that I can predict the number of global loads for each segment size.
My program has perfectly coalesced accesses with array element size of 4byte. Then, there is only 64 byte.
Block size is 16x16, so there should 16 64 Byte global memory loads for one block right?
In this way, I can get the total gld 64b by 16 * grid size.
However, the result from cuda profiler did not same as this one. Those were too small.
Of course, I tested varying the input size. The results were the same… not correct.
How does the profiler obtain those numbers? Is there any possibility that those numbers are not correct?