Hi,
I’m trying to optimize a CUDA kernel and while examining the PTX output I discovered the following issue. A snippet from my CUDA kernel looks something like this:
[codebox]
texture<float, 2> tex_ref; // bound to a 2D array
float sum = 0;
unsigned tex_x = func1();
unsigned tex_y = func2();
sum += tex2D(tex_ref, tex_x, tex_y);
[/codebox]
This translates roughly into the following PTX code:
[codebox]
shl.b32 %r40, %r39, 4; //
shr.u32 %r41, %r40, 29; //
cvt.rn.f32.u32 %f1, %r41; // convert X coordinate to float
shl.b32 %r42, %r39, 16; //
shr.u32 %r43, %r42, 29; //
cvt.rn.f32.u32 %f2, %r43; // convert Y coordinate to float
mov.f32 %f3, 0f00000000; // Z=0
mov.f32 %f4, 0f00000000; // W=0
tex.2d.v4.f32.f32 {%f5,%f6,%f7,%f8},[_tex_ref,{%f1,%f2,%f3,%f4}]; // texture lookup with float coords
[/codebox]
Since I’m not using any texture filtering or similar, it seems like converting the texture coordinates to float is pointless?
I tried to manually change the tex.2d call in the ptx code above into a texture lookup with integer coordinates (%r41=x, %r42=y):
tex.2d.v4.f32.s32 {%f5,%f6,%f7,%f8},[_tex_ref,{%r41,%r42,%r0,%r0}];
This seems to compile fine (But I haven’t tried if it actually works). Is there any way to tell nvcc to use integer coordinates automatically and not add unnecessary conversions to floats when doing 2d texture lookups? If I understand the CUDA guide correctly, int->float conversions are fast (4 cycles), but still, if they aren’t needed, I’d like to get rid of them.
Btw, for 1d textures in linear memory, it seems like nvcc is using integer coordinates.
/Lars