Weird behaviour while using textures tex1Dfetch error

Hi,

I am running a kernel which accesses memory using textures, on two GPUs. I am using the standard way of using textures, that is declare the texture as global variable and then do a tex1Dfetch in the kernel. Since I am using multiple GPUs, therefore multiple threads, and the texture is a global variable, this leads to racing conditions. 

Since I don’t know a way to do this without using a global variable, I declared a vector of textures, which I index using a thread-local variable. Therefore the declaration

texture<int, 1, cudaReadModeElementType> tex;

became

texture<int, 1, cudaReadModeElementType> tex[2];

and the access became from

tex1Dfetch(tex,p)

became

tex1Dfetch(tex[current_thread],p)

The problem is that this does not compile. I am getting the following error:

error: a value of type "_Z7textureIiLi1EL19cudaTextureReadMode0EE" cannot be assigned to an entity of type "__i1texture"

  iBeg = (((__T237 = ((tex[current_thread]))) , ((__T239 = (__itexfetchi(__T237, ((((((__T238.x) = i) , ((__T238.y) = 0)) , ((__T238.z) = 0)) , ((__T238.w) = 0)) , __T238)))) , ((__T239.x)))) - 1);

Thinking that there might be some weird problem with using vectors of textures, I put the texture in a structure, and then indexed that with the thread index. But I am getting the same error …

Does someone have any idea how to solve this? Or maybe a way to use textures without using a global variable ?

Thanks,

Serban

I agree the way textures are handled are very annoying, though I can think of two ways I would expect to solve it:

  1. Have a separate CUDA context for each thread (I never tried that, but I would imagine it works)

  2. Use a global lock for each texture and do

acquire lock

bind texture

call kernel

release lock

Yes, that reduces parallelism but should not matter since the hardware can execute only one kernel at a time currently anyway.

I have not tested any of this, but I would be interested if any of it works. While I do not strictly need it for my project, I consider non-threadsafe code (especially of only for silly reasons) an abominmation that really should be avoided.

Reimar, thanks for your quick reply.

I think I already have two contexts, since the first thing each thread does after being created is calling CudaSetDevice for it’s corresponding GPU.
I don’t think the second one would work either since I am using 2 GPUs not one, so I can execute two kernels at the same time. Using such a lock would eliminate all the benefits for using more than one GPUs.
Very annoying indeed …
It would be so much easy if I could get rid of the global variable somehow.

Well, usually it should not, since the lock would only be active for the time it takes to bind the texture and launch the kernel. I do not know if it would work though, how CUDA handles textures is a complete mystery to me, some better documentation would be truly helpful.

But even if you can not use arrays, you can of course still use multiple textures and use a sequence of if/else to select the right one. To reduce overhead you could also have multiple kernels, one for each such texture - I do not know what the best way to do this without code duplication is, a macro would certainly work but is ugly, if you are lucky “static inline device” functions work with texture arguments (since they are always inlined), global functions certainly do not though.

Just tried the if/else version and of course it works. I also thought of that, but I want to keep it as a last resort … I have many kernels, and if each one has versions for 1/2/3/4 GPUs and I have if/elses everywhere … kind of makes me sick only when I think about it.

Trying now to use the lock, but I have a feeling that you cannot release the lock until the kernel finishes execution, since texture fetches are done thought the execution of the kernel, and bad things may happen if I change the binding in the middle.

Finished trying using the locks.
As I was expecting, it only works if the unlock() is after calling the kernel, so the release is after all execution is over. Tried releasing the lock from inside the kernel, so that I have only the launch in the critical section, but it does not work.
I would try the static inline device thing but I don’t know where to put it. I still need to launch the kernel, which must be global

Uh, yes of course it has to be after calling the kernel, but usually (unless you have already submitted about 16 kernels that have not yet finished), the kernel call will return immediatly. Only cuThreadSynchronize, cudaMemcpy etc. will wait for the execution to finish.

Well, what I though of is something like (you probably should use C++ templates for this)

static inline float2 __device__ mytex1D(texture<float2, 1> tex0, texture<float2, 1> tex1, ... , float pos, int gpu) {

  switch (gpu) {

    case 0:

      return tex1D(tex0, pos);

    case 1:

      return tex1D(tex1, pos);

...

  }

}

and use that instead of the ordinary tex1D. Saves you from splattering the texture selection logic all over the place.

Why are you certain that you need this? The global variable “tex” represents different textures in different CUDA contexts. No locking required. Just cudaBindTexture the same global variable tex to different device memory in each host thread. The CUDA driver knows which context it is in and will handle everything correctly.

Well, this is what I thought in the beginning, but almost all the time I had one of the two kernels, sometime the one on the first GPU, sometimes the second, fail because of " cannot fetch from a texture that is not bound" error. Since inserting a delay between the start of the threads eliminated the problem, as did using two different texture variables instead of one, I concluded that this must be the problem.

After reading some older posts related to arrays of texture references, I understand that one cannot do anything with textures, like passing pointers around and so on. So I started to implement the solution suggested by Reinmar, basically a big if/else. Disgusting, but the only one working so far …

One thing I want to check is that I actually have two contexts and not one. I am starting the application and do one CUT_DEVICE_INIT() in the main thread. Then I spawn two threads, and each one does a CukrSetDevice(gpu_id). Does this generate two contexts ?

External Image External Image

Sorry to have wasted your time.
After further debugging, I found out the error was actually caused by a bug at another place in the program, which lead to one of the GPUs not being properly initialized (not always, only from time to time). Now that I solved the problem, all seems to be OK.
Multi-threaded programming is difficult …

This should generate 3 contexts, one in the main thread and one each in the spawned threads. Just for fun, try taking out the CUT_DEVICE_INIT in the main thread (and remove any calls to cuda in the main thread). The first CUDA call in any thread will create a context in that thread.

I’m not sure what would be causing the behavior you see with the “not bound” error. I’ll try a few tests here and see if I can reproduce it.

Half-ignore my previous post, it’s only half true. So this is how I think it is.

When compiling in emulation-mode, as I was doing all this time and I kept getting errors, there is probably only one CUDA context or something like that, so the “cannot fetch from a texture that is not bound” error keeps coming. Same result for calling cut_init from the main thread only, from the worker threads only, or from all of them. That’s why I have the feeling that only one context is actually created. The only way I could get around this (besides locks) is a big ugly if/else which chooses a different texture for different threads.

When compiling in device-mode, all seems to be OK, again regardless where the CUT_DEVICE_INIT is called. So for convenience, I call init from the main thread and the set the device in the worker threads. At first, I count not see this because of some different problem.

MisterAnderson42, I wonder if you get the same behavior.