Try using float4 structure this way you only need one call tmp=srcData[ind]; Try to rewrite the loop. Your code is for a general case, but you cuold unroll the loop for a few particular cases.
I think that Visual Profiler is giving me false feedback.
I make test and adjust input data size, so portions for blocks are 128-bit aligned (and I did nothing to kernel code). In that case Profiler shows 100% Load Efficiency as expected.
However, this was change from 1532 floats to 1536 floats per portion, and I haven’t observe any speed difference (in both cases application run time was 31 seconds, but Profiler shows 100% vs 2.3% Load Efficiency in memory bound kernel when kernel is about 90% of application run time).
Strange to me, but it seems, that Visual Profiler is giving me wrong feedback.
This happens very often to me. The bandwidth seems to have the same problem. Sometimes I have very low bandwith which shouldn’t be the case and sometimes it’s over the bandwith which is possible with that hardware.