Avoiding unecessary int->float conversion in tex2D() ?

Hi,

I’m trying to optimize a CUDA kernel and while examining the PTX output I discovered the following issue. A snippet from my CUDA kernel looks something like this:

[codebox]

texture<float, 2> tex_ref; // bound to a 2D array

float sum = 0;

unsigned tex_x = func1();

unsigned tex_y = func2();

sum += tex2D(tex_ref, tex_x, tex_y);

[/codebox]

This translates roughly into the following PTX code:

[codebox]

    shl.b32         %r40, %r39, 4;          // 

    shr.u32         %r41, %r40, 29;         // 

    cvt.rn.f32.u32  %f1, %r41;              //   convert X coordinate to float 

    shl.b32         %r42, %r39, 16;         // 

    shr.u32         %r43, %r42, 29;         // 

    cvt.rn.f32.u32  %f2, %r43;               //  convert Y coordinate to float

    mov.f32         %f3, 0f00000000;        // Z=0

    mov.f32         %f4, 0f00000000;        // W=0

    tex.2d.v4.f32.f32 {%f5,%f6,%f7,%f8},[_tex_ref,{%f1,%f2,%f3,%f4}];   // texture lookup with float coords

[/codebox]

Since I’m not using any texture filtering or similar, it seems like converting the texture coordinates to float is pointless?

I tried to manually change the tex.2d call in the ptx code above into a texture lookup with integer coordinates (%r41=x, %r42=y):

tex.2d.v4.f32.s32 {%f5,%f6,%f7,%f8},[_tex_ref,{%r41,%r42,%r0,%r0}];

This seems to compile fine (But I haven’t tried if it actually works). Is there any way to tell nvcc to use integer coordinates automatically and not add unnecessary conversions to floats when doing 2d texture lookups? If I understand the CUDA guide correctly, int->float conversions are fast (4 cycles), but still, if they aren’t needed, I’d like to get rid of them.

Btw, for 1d textures in linear memory, it seems like nvcc is using integer coordinates.

/Lars

Inspect the CUDA header files to see whether there is a definition or macro for tex2D()
Maybe it’s just as simple as creating another definition that accepts ints instead of floats.

Christian

Good idea! I had a look at the headers (texture_fetch_functions.h), and the tex2D() macro seems to be using the device builtin function __ftexfetch(), which is not defined for integer coordinates. Maybe I didn’t dig deep enough, but I wasn’t able to find out how to tweak the macros to support int coordinates. Instead, I decided to try modifying the ptx to see if it actually works. I started with the following simple kernel:

[codebox]

texture<float, 2> test_tex;

extern “C”

global void test_kernel(float *dst)

{

dst[threadIdx.x] = tex2D(test_tex, threadIdx.x, 1);

}

[/codebox]

which reads floats from a 64x64 2D texture and writes them to global memory.

This is the slightly modified ptx code for the kernel above:

[codebox]

.entry test_kernel

{

.reg .s32 %r<5>;

.reg .u64 %rd<6>;

.reg .f32 %f<11>;

.param .u64 __cudaparm_test_kernel_dst;

.loc    15  5   0

$LBB1_test_kernel:

cvt.s32.u16     %r1, %tid.x;        //

// ORIGINAL CODE: 2d texture lookup with float coordinates

// cvt.rn.f32.u32 %f1, %r1; //

// mov.f32 %f2, 0f3f800000; // 1

// mov.f32 %f3, 0f00000000; // 0

// mov.f32 %f4, 0f00000000; // 0

// tex.2d.v4.f32.f32 {%f5,%f6,%f7,%f8},[test_tex,{%f1,%f2,%f3,%f4}];

// NEW CODE: 2d texture lookup with integer coordinates

mov.s32 %r2, 2; // Y

mov.s32     %r3, 0;                 // Z,W

tex.2d.v4.f32.s32 {%f5,%f6,%f7,%f8},[test_tex,{%r1,%r2,%r3,%r3}];

mov.f32 %f9, %f5; //

ld.param.u64    %rd1, [__cudaparm_test_kernel_dst]; // id:24 __cudaparm_test_kernel_dst+0x0

cvt.u64.s32     %rd2, %r1;          // 

mul.lo.u64  %rd3, %rd2, 4;      // 

add.u64     %rd4, %rd1, %rd3;       // 

st.global.f32   [%rd4+0], %f9;  // id:25

exit;                           // 

$LDWend_test_kernel:

} // test_kernel

[/codebox]

I commented out the original, float based code and modified the kernel to pass integer coordinates to tex.2d.v4… Indeed, it does seem to work, with the exception that none of the addressing modes CLAMP, WRAP and MIRROR behaves as expected with int coordinates. A texture access for any int coordinates outside [0,63] simply returns 0.

I don’t know if getting rid of a few cvt instructions would ever make any noticable difference in reality, but it would be nice to be able to instruct nvcc to support it, just to be able to give it a try. It would get rid of quite a few instructions in my kernels.

/Lars