no speedup from coalescing global reads?! Surprising profile results

btrapnel · March 7, 2008, 5:06am

I recently reordered some data in my CUDA program so that my read from global memory would be coalesced. I believe I have nearly all reads coalesced properly, as evidenced by the profile output below:

gputime: 43422.2
gld_incoherent: 746
gld_coherent: 304708
gst_incoherent: 0
gst_coherent: 215656
branch: 1451145
divergent_branch: 8397

And here is the profile output for the old version:

gputime: 43259.3
gld_incoherent: 398499
gld_coherent: 778
gst_incoherent: 55134
gst_coherent: 14728
branch: 1450768
divergent_branch: 8395

So my question is: Why hasn’t the gputime gone down? Was the latency simply hidden that well in the original version?

Thanks in advance.

DenisR · March 7, 2008, 6:25am

it looks like you have many more global stores 215k vs 70k, that may be a reason
another possible reason is that your kernel is compute bound, not memory bound

Topic		Replies	Views
About global memory CUDA Programming and Performance	0	1957	October 19, 2008
read from global mem vs write to global mem CUDA Programming and Performance	13	6593	January 22, 2009
Coalescing the Global memory load/store not giving any speed-up CUDA Programming and Performance	2	5214	March 7, 2009
Coalesced memory access example CUDA Programming and Performance	2	3375	March 28, 2011
global memory latency CUDA Programming and Performance	12	16312	December 13, 2007
coalesce counter meaning CUDA Programming and Performance	5	4356	April 15, 2009
global memory latency CUDA Programming and Performance	4	2176	June 22, 2008
Coalescing - beginner question CUDA Programming and Performance	10	1885	June 23, 2010
profiler - memory coalescing question CUDA Programming and Performance	0	656	February 23, 2016
doubt in coalesced reads CUDA Programming and Performance	17	1668	November 6, 2010

no speedup from coalescing global reads?! Surprising profile results

Related topics