Coalescing the Global memory load/store not giving any speed-up

I am working on an image processing application which uses 8bit load/store from global memory. As the 8/16 bit loads do not get coalesced so even when i was reading consecutive memory locations the profiler indicated all the loads/stores as uncoalesced. I was getting the timing of 1.5 msec with that.

With hope of improving this time I converted all the loads and stores to 32 bit by clubbing 4 pixel data together. Now the profiler identifies all the loads/stores as coalesced but still the performance remained almost the same. What can be the possible reason for this.?

My application uses a lot of shared memory/thread thus I am forced to use only 64 threads/block to have 8 blocks per SM and get a max occupancy of .67

Also when it is said that the warp execution hides the global memory latency if enough warps or threads are present what exactly does it mean? I currently have only 16 warps running out of 24 at a time so can this be a reason for no visible speed-up even after coalescing?

Kindly Help!

hello? any one? bounce!

This means that is on a given multiprocessors, there are warps waiting for a global memory transaction and other warps ready to execute then the scheduler can remove the warp waiting on a transaction and launch the one thats ready to execute, thus hiding the memory latency.

Are you saying you have 16 warps total on the card or 16 warps per multiprocessor? If its the former, then youre not stressing the video card a whole lot.

As far as not gaining any performance on coalescing reads and writes, could it be that you are not memory bound in the first place?

If you are doing enough operations on your pixels as you seem to be implying by saying that you make heavy use of shared memory, it could be that you are compute bound.