compiler generating correct ptx texture load? ushort generates u32 load

eelsen · December 13, 2007, 10:59pm

I recently attempted to speed up my application by using 16bit integers to store my data instead of 32bit floats. The application is completely bandwidth bound, so this should provide a significant speedup (and the loss of accuracy is acceptable.) HOWEVER, the speedup is quite small <15%. Looking at the generated ptx makes me wonder if the right code is being generated.

Previous version:

texture<float2, 2, cudaReadModeElementType> LUT;

float2 val1 = tex2D(LUT, loc.x, loc.y );

generated ptx load instruction is:

tex.2d.v4.f32.f32 {$f88,$f89,$f90,$f91},LUT,{$f84,$f85,$f86,$f87};

It puts 0 in both f86 and f87, and the ptx manual would suggest one only needs to specify two

coords for a 2d texture, so I don’t quite understand this. What really worries me though is that there are 4 destination registers when there should only be 2. (Is it using twice the bandwidth it should?)

In the new version (should be reading half the total # of bytes from memory, but performance increase is < 15%0:

texture<ushort2, 2, cudaReadModeNormalizedFloat> LUT;

float2 val1 = tex2D(LUT, loc.x, loc.y);

and the generated instruction is:

tex.2d.v4.u32.f32 {$r88,$r89,$r90,$r91},LUT,{$f84,$f85,$f86,$f87};

which seems weird because it should be loading the result into floating registers as well. Further inspection reveals the following code later on:

mov.s32 $r92, $r88;

mov.b32 $f107, $r92;

sub.f32 $f108, $f56, $f107

which just seems really inefficient, but maybe it gets optimized away when the ptx is compiled.

I then thought that perhaps ReadNormalizedFloat is the same speed as reading floats since it has to write the same amount of data into a register, even if it only reads half as many from main memory.

Therefore, I tried reading the values as true ushort2s (ReadElementType) to see if the load instruction would have u16. It did not. It looked identical to the above load instruction. And it had identical performance to the readNormalizedFloat version.

Does the hardware always read at least 32 bits no matter what? Any ideas?

seb · December 13, 2007, 11:36pm

I don’t have an answer to your problem just two thoughts:

You could try decuda (http://www.cs.rug.nl/~wladimir/decuda/) to disassemble the compiled PTX and look at what’s really happening. It’s fairly easy to use and proofs very useful when dealing with CUDA awkwardness.
Are you sure your problem is bandwidth bound and not latency bound?

eelsen · December 14, 2007, 6:49pm

Thanks for pointing out decuda. It’s output makes a lot more sense than the ptx.

All of the texture load instructions now look like this:

tex.2d.b32.f32 {$r0,$r1,_,_}, $tex2, {$r0,$r1}

So it actually doesn’t have more destination registers than it should. But in all cases, even when loading the ushort, the destination type is b32. In this case I would guess it reads 16 bits from memory and just puts them in the lower part of the register, based on later arithmetic operations.

We definitely were memory bound. However, it now seems likely that after reducing the memory bandwidth we become latency bound.

wumpus · December 15, 2007, 1:57pm

Indeed, texture loads always return 32 bit quantities, even if you read from a short/ushort texture. The texture internal format is as it should, shorts, so that’s no problem.

Topic		Replies	Views
questions about PTX file CUDA Programming and Performance	1	3178	July 28, 2008
.tex u32 CUDA Programming and Performance	2	2353	August 30, 2007
OpenCV Image loading in CUDA texture CUDA Programming and Performance	11	2543	October 12, 2021
does tex2D always returns vector 4 data? CUDA Programming and Performance	6	9305	March 25, 2008
Single component textures read with tex.1d.v4.f32.s32???? CUDA Programming and Performance	0	1829	July 23, 2007
3GB can it be read as texture? CUDA Programming and Performance	25	3409	December 31, 2014
Using tex2D for unsigned short/char CUDA Programming and Performance	14	3903	November 15, 2017
Test of new 16 bit float half type in CUDA 7.5 CUDA Programming and Performance	12	5466	June 7, 2016
[Solved] Texture access and inline CUDA ptx assembly in VS 2010 CUDA Programming and Performance	3	1124	September 8, 2013
What about half-float? CUDA Programming and Performance	18	29646	October 26, 2017

compiler generating correct ptx texture load? ushort generates u32 load

Related topics