Coalescing the Global memory load/store not giving any speed-up

sidxavier · March 7, 2009, 6:58am

I am working on an image processing application which uses 8bit load/store from global memory. As the 8/16 bit loads do not get coalesced so even when i was reading consecutive memory locations the profiler indicated all the loads/stores as uncoalesced. I was getting the timing of 1.5 msec with that.

With hope of improving this time I converted all the loads and stores to 32 bit by clubbing 4 pixel data together. Now the profiler identifies all the loads/stores as coalesced but still the performance remained almost the same. What can be the possible reason for this.?

My application uses a lot of shared memory/thread thus I am forced to use only 64 threads/block to have 8 blocks per SM and get a max occupancy of .67

Also when it is said that the warp execution hides the global memory latency if enough warps or threads are present what exactly does it mean? I currently have only 16 warps running out of 24 at a time so can this be a reason for no visible speed-up even after coalescing?

Kindly Help!

sidxavier · March 7, 2009, 4:56pm

hello? any one? bounce!

Ailleur · March 7, 2009, 6:33pm

This means that is on a given multiprocessors, there are warps waiting for a global memory transaction and other warps ready to execute then the scheduler can remove the warp waiting on a transaction and launch the one thats ready to execute, thus hiding the memory latency.

Are you saying you have 16 warps total on the card or 16 warps per multiprocessor? If its the former, then youre not stressing the video card a whole lot.

As far as not gaining any performance on coalescing reads and writes, could it be that you are not memory bound in the first place?

If you are doing enough operations on your pixels as you seem to be implying by saying that you make heavy use of shared memory, it could be that you are compute bound.

Topic		Replies	Views
global memory latency CUDA Programming and Performance	12	16262	December 13, 2007
1 coalesced global memory load = 16 loads? CUDA Programming and Performance	0	945	January 23, 2011
coalescing problem CUDA Programming and Performance	4	1121	August 8, 2011
Memory access should be coalesced but is not CUDA Programming and Performance	6	1155	May 16, 2019
How bad are non-coalesced STORES to gl. mem? CUDA Programming and Performance	2	2905	August 14, 2008
Profiler says global memory loads are not coalesced CUDA Programming and Performance	0	7391	August 2, 2011
no speedup from coalescing global reads?! Surprising profile results CUDA Programming and Performance	1	1648	March 7, 2008
global memory latency CUDA Programming and Performance	4	2165	June 22, 2008
Coalesced writes CUDA Programming and Performance	2	1295	May 26, 2016
Is these way coalesced access? CUDA Programming and Performance	0	417	March 6, 2020

Coalescing the Global memory load/store not giving any speed-up

Related topics