I have an algorithm that used to have 100 % uncoalesced accesses. After reading the Performance Chapter in the programming guide, I was able to turn that into 100 % coalesced accesses.
However, due to following statement in the Programming Guide (for CC 1.1), I would have assumed that I could expect a speed-up of 16x (provided that my algorithm is purely bandwidth-limited, which seems to be the case).
Therefore, I had 16 different memory accesses when the program was still uncoalesced. Now that it is coalesced, the number of memory accesses should have been reduced by a factor of 16.
However, in practice the execution time only decreased by about 33 %. Since the memory loads take around 400 cycles each, I assumed that this is the only dominant part. I wonder how one can explain those 33 %? What did I miss?
btw: for these measurements, I only launched a single block with 32 threads. although that can of course not give any good performance, it should at least have reflected the speedup of 16x - or that was my assumption until now at least. (Later on I tried different execution configurations, with roughly the same “lame” speedup).
any feedback greatly appreciated!