I recently attempted to speed up my application by using 16bit integers to store my data instead of 32bit floats. The application is completely bandwidth bound, so this should provide a significant speedup (and the loss of accuracy is acceptable.) HOWEVER, the speedup is quite small <15%. Looking at the generated ptx makes me wonder if the right code is being generated.
texture<float2, 2, cudaReadModeElementType> LUT; float2 val1 = tex2D(LUT, loc.x, loc.y );
generated ptx load instruction is:
It puts 0 in both f86 and f87, and the ptx manual would suggest one only needs to specify two
coords for a 2d texture, so I don’t quite understand this. What really worries me though is that there are 4 destination registers when there should only be 2. (Is it using twice the bandwidth it should?)
In the new version (should be reading half the total # of bytes from memory, but performance increase is < 15%0:
texture<ushort2, 2, cudaReadModeNormalizedFloat> LUT; float2 val1 = tex2D(LUT, loc.x, loc.y);
and the generated instruction is:
which seems weird because it should be loading the result into floating registers as well. Further inspection reveals the following code later on:
mov.s32 $r92, $r88; mov.b32 $f107, $r92; sub.f32 $f108, $f56, $f107
which just seems really inefficient, but maybe it gets optimized away when the ptx is compiled.
I then thought that perhaps ReadNormalizedFloat is the same speed as reading floats since it has to write the same amount of data into a register, even if it only reads half as many from main memory.
Therefore, I tried reading the values as true ushort2s (ReadElementType) to see if the load instruction would have u16. It did not. It looked identical to the above load instruction. And it had identical performance to the readNormalizedFloat version.
Does the hardware always read at least 32 bits no matter what? Any ideas?