no speedup from coalescing global reads?! Surprising profile results

I recently reordered some data in my CUDA program so that my read from global memory would be coalesced. I believe I have nearly all reads coalesced properly, as evidenced by the profile output below:

gputime: 43422.2
gld_incoherent: 746
gld_coherent: 304708
gst_incoherent: 0
gst_coherent: 215656
branch: 1451145
divergent_branch: 8397

And here is the profile output for the old version:

gputime: 43259.3
gld_incoherent: 398499
gld_coherent: 778
gst_incoherent: 55134
gst_coherent: 14728
branch: 1450768
divergent_branch: 8395

So my question is: Why hasn’t the gputime gone down? Was the latency simply hidden that well in the original version?

Thanks in advance.

it looks like you have many more global stores 215k vs 70k, that may be a reason
another possible reason is that your kernel is compute bound, not memory bound