I recently reordered some data in my CUDA program so that my read from global memory would be coalesced. I believe I have nearly all reads coalesced properly, as evidenced by the profile output below:
And here is the profile output for the old version:
So my question is: Why hasn’t the gputime gone down? Was the latency simply hidden that well in the original version?
Thanks in advance.