I recently reordered some data in my CUDA program so that my read from global memory would be coalesced. I believe I have nearly all reads coalesced properly, as evidenced by the profile output below:
gputime: 43422.2
gld_incoherent: 746
gld_coherent: 304708
gst_incoherent: 0
gst_coherent: 215656
branch: 1451145
divergent_branch: 8397
And here is the profile output for the old version:
gputime: 43259.3
gld_incoherent: 398499
gld_coherent: 778
gst_incoherent: 55134
gst_coherent: 14728
branch: 1450768
divergent_branch: 8395
So my question is: Why hasn’t the gputime gone down? Was the latency simply hidden that well in the original version?
Thanks in advance.