I am working on an image processing application which uses 8bit load/store from global memory. As the 8/16 bit loads do not get coalesced so even when i was reading consecutive memory locations the profiler indicated all the loads/stores as uncoalesced. I was getting the timing of 1.5 msec with that.
With hope of improving this time I converted all the loads and stores to 32 bit by clubbing 4 pixel data together. Now the profiler identifies all the loads/stores as coalesced but still the performance remained almost the same. What can be the possible reason for this.?
My application uses a lot of shared memory/thread thus I am forced to use only 64 threads/block to have 8 blocks per SM and get a max occupancy of .67
Also when it is said that the warp execution hides the global memory latency if enough warps or threads are present what exactly does it mean? I currently have only 16 warps running out of 24 at a time so can this be a reason for no visible speed-up even after coalescing?
Kindly Help!