putting a large array of cudaTextureObject_t into a texture?

Hi everyone,

our next radio simulation kernel will require an array of several thousand of CUDA texture objects (currently we’ve set the limit to 10000). That won’t fit constant memory and hence it cannot be passed as kernel launch arguments.

I was wondering if there is a good way of putting this array itself into a texture to speed up access. We’ll have several consecutive threads access the same texture object - hence the texture cache should help a lot.

The problem is that a cudaTextureObject_t is typedef’d as unsigned long long - so how does one create a texture capable of storing such a behemoth object? Split it into lo and hi parts maybe and use a two channel integer texture? Are there 32 bit integer textures even? Or maybe we could use a 2 channel 32bit float texture and use the __float_as_int() intrinsic followed by reassembly of lo and hi parts into a 64 bit integer… hmm…

Any thoughts?

Declare the texture as ‘int2’, then reinterprete the ‘int2’ as a ‘long long’. That is the same technique used to store ‘double’ data in textures. For the ‘double’ case one can use __hiloint2double() for the reinterpretation, no such predefined intrinsic exists for ‘int2’ read as ‘longlong’, but you could either use __double_as_longlong (__hiloint2double()) or create your own function with a tiny bit of inline PTX using the movb.u64 instruction.

You posted this helpful snippet into an answer to a related stackoverflow question earlier

__forceinline__ __device__ long long int int2_as_longlong (int2 a)
{
    long long int res;
    asm ("mov.b64 %0, {%1,%2};" : "=l"(res) : "r"(a.x), "r"(a.y));
    return res;
}