Texture cache API, implications for multiGPU

I tried to get some deeper understanding of how to make use of the texture memory in CUDA, but the more I look at the documentation and into the header files, the more confused I get (the problem is not to get it working somehow).

(1) Although there are some “low level” C API functions, it seems not to be possible to get around some C++ API calls. Even the example in the Programming Guide (which has also some errors in it…) -although stating using the C API- instantiates a texture<> template first and retrieves a textureReference* through some obscure function through the textures variable name.

(2) Another post in this forum indicated, that it is necessary to put a texture<> object in the global namespace to make it accessible from a kernel, nvcc wants to do some magic behind the scenes (which might also be related to the C-API name lookup mentioned in (1)). How does this work in the context of MultiGPU-computing? Assume I have a kernel that makes use of texture memory, which should be executed on several GPUs. As each (cuda) thread needs its own texture<>, I’d also need different kernels, each of which references another texture<>.

(3) The usage of the struct cudaChannelFormatDesc looks totally crazy to me. There is struct cudaChannelFormatDesc member in struct textureReference (a base of texture<>), but in all examples another object is created temporarily. Obviously, most members (texture size, pitch) are neither configured through this temporary nor the member of the textureReference base of texture<>, but passed separately to the bind function (which might in turn change the textureReference’s members).

So based on this my questions are:

(A) Is it possible to use texture memory through the C API only, perhaps by some undocumented approach?

(B) Is putting a texture<> into the scope of the kernel implementation file the only possibilty? How can one run a kernel on multiple GPUs withough artifically creating multiple texture objects and associated, otherwise identical kernels versions at compile time?
© Is there any sense in using the additonal struct cudaChannelFormatDesc? Could one also pass a pointer to the textureReference (or in this case texture<>) member?

I heard there is already CUDA 3.0 beta out for registered developers, will it get better there?

Best regards,
Markus

Of course. The c++ API is just template syntactic sugar on top of the C API. You might check in the CUDA reference manual for the exact specifications of the C API texture functions. I like the syntactic sugar, so I’ve never done it the hard way.

Each CUDA context gets its own instantiation of each texture reference you describe (think automatic thread-local storage). Thus a texture reference is automatically independent among multiple host threads . You must bind it in each thread for use in independant kernel launches across multiple GPUs.

It is open to the public, download it from the CUDA 3.0 beta announcement forum post. There are no changes in the way texture references are handled, at least that I noticed.

Sorry, I knew there was a 3ß, but didn’t know it was publicly available.

I might be blind, but I don’t find anything regarding device code in the reference manual. As long as I need a global texture<> object on the kernel side, I also need one on the host side.

Please let me clarify that I was only talking about the C++ object. You surely are correct in that respect, that cudaBindTexture (or any derivative) creates a thread/GPU-local binding of the device memory area to a texture. My considerations are more about race conditions before actually performing that binding.

I used the nvcc -cuda option to have a look at the actual cpp code g++ gets:

The global texture reference really becomes a static object at global scope, and must consequentially be defined together with the kernel and its calling host routine in a single file.

static texture< float, 2, cudaReadModeElementType> texref;

Now consider two threads that want to execute a kernel using texref on ‘‘their’’ GPU. To configure their addressing modes (which is run-time configurable and should therefore be allowed to differ), they will have to modify the texture<> object accordingly before performing the bind call to the runtime. So at least this is a critical operation. For other parameters a separate struct cudaCreateChannelDesc is used (although there is also one as member of textureReference), so this is not problematic.

I really don’t understand why the struct cudaChannelFormatDesc member channelDesc of textureReference is not used. Neither the texture<> nor the struct cudaChannelFormatDesc argument of cudaBindTexture seem to be modified (at least the prototypes use pointers to const).

As I see it now, calling a kernel using texture cache on two GPUs in the same process should be possible, but setting up the binding parameters is partially a critical operation and must be protected by the programmer, typically by employing a mutex. Does this sound reasonable or does someone disagree?