compiler generating correct ptx texture load? ushort generates u32 load

I recently attempted to speed up my application by using 16bit integers to store my data instead of 32bit floats. The application is completely bandwidth bound, so this should provide a significant speedup (and the loss of accuracy is acceptable.) HOWEVER, the speedup is quite small <15%. Looking at the generated ptx makes me wonder if the right code is being generated.

Previous version:

texture<float2, 2, cudaReadModeElementType> LUT;

float2 val1 = tex2D(LUT, loc.x, loc.y );

generated ptx load instruction is:

tex.2d.v4.f32.f32 {$f88,$f89,$f90,$f91},LUT,{$f84,$f85,$f86,$f87};

It puts 0 in both f86 and f87, and the ptx manual would suggest one only needs to specify two

coords for a 2d texture, so I don’t quite understand this. What really worries me though is that there are 4 destination registers when there should only be 2. (Is it using twice the bandwidth it should?)

In the new version (should be reading half the total # of bytes from memory, but performance increase is < 15%0:

texture<ushort2, 2, cudaReadModeNormalizedFloat> LUT;

float2 val1 = tex2D(LUT, loc.x, loc.y);

and the generated instruction is:

tex.2d.v4.u32.f32 {$r88,$r89,$r90,$r91},LUT,{$f84,$f85,$f86,$f87};

which seems weird because it should be loading the result into floating registers as well. Further inspection reveals the following code later on:

mov.s32 $r92, $r88;

mov.b32 $f107, $r92;

sub.f32 $f108, $f56, $f107

which just seems really inefficient, but maybe it gets optimized away when the ptx is compiled.

I then thought that perhaps ReadNormalizedFloat is the same speed as reading floats since it has to write the same amount of data into a register, even if it only reads half as many from main memory.

Therefore, I tried reading the values as true ushort2s (ReadElementType) to see if the load instruction would have u16. It did not. It looked identical to the above load instruction. And it had identical performance to the readNormalizedFloat version.

Does the hardware always read at least 32 bits no matter what? Any ideas?

I don’t have an answer to your problem just two thoughts:

  1. You could try decuda (http://www.cs.rug.nl/~wladimir/decuda/) to disassemble the compiled PTX and look at what’s really happening. It’s fairly easy to use and proofs very useful when dealing with CUDA awkwardness.
  2. Are you sure your problem is bandwidth bound and not latency bound?

Thanks for pointing out decuda. It’s output makes a lot more sense than the ptx.

All of the texture load instructions now look like this:

tex.2d.b32.f32 {$r0,$r1,_,_}, $tex2, {$r0,$r1}

So it actually doesn’t have more destination registers than it should. But in all cases, even when loading the ushort, the destination type is b32. In this case I would guess it reads 16 bits from memory and just puts them in the lower part of the register, based on later arithmetic operations.

We definitely were memory bound. However, it now seems likely that after reducing the memory bandwidth we become latency bound.

Indeed, texture loads always return 32 bit quantities, even if you read from a short/ushort texture. The texture internal format is as it should, shorts, so that’s no problem.