texunit numbering in the .cubin

I have implemented some fairly complex kernels and run into a problem with the numbering of texunits.

My code uses templates and instantiates my kernels on a variety of data types.
As I have been unable to successfully pass in textures of the corresponding format as argument to the kernels, I have instead made a lot (37 to be precise) texture<‘type’, 1> variables globally in my .cu file. Using macros based on another template argument I can then pass the correct texture variable names to “tex1Dfetch()” without any overhead.

This compiles into many different kernels in the .cubin but none of them accesses more than approx. 8 textures each. Unfortunately the 37 textures are referenced as texunit 0 through texunit 36 at the start of the .cubin. Consequently the textures numbered >31 cannot be read in the kernel. They return 0 as I can tell, whereas it works in emulation mode.

Personally I can move on for now reordering my textures as they are actually not all used for now. But it will eventually give me problems, so I am asking if anyone could suggest good solution?

Otherwise, I am really impressed with the power of templates programming available in Cuda. No more coding specifically for 1D, 2D, 3D, and 4D. Thanks NVidia!



You can textures by type, i.e. define n_i textures for type i, where n_i is the maximal number of type i textures used simultaneously. Normally, this would result in fewer total textures. Then, by even more complex template/MACRO fiddling, you may work out the texture ID in your kernels.

Isn’t the Driver API designed for this sort of thing?

You can compile your .cu kernels into individual cubins and load the cubins individually. The texture references will restart for each cubin, and there’s functions to map and manage those as well.

Of course I think the Runtime API should have not had this problem to begin with, but the Driver API is there when automagicallity lets you down.

Thanks. I guess loading cubins individually would be a solution. It would be nice though if the standard approach was just a bit more intelligent. Anyone from Nvidia has any comments?