texture latency and sync strange behavior

Hi,

just found some strange behavior in CUDA 0.9. Maybe it’s a bug,

but I’m not sure. It’s this (pseudo) code fragment:

__shared__ float buffer[BLOCKDIM];

// lot of stuff

__syncthreads();

 buffer[threadIdx.x]=tex1Dfetch(myTex, position+threadIdx.x);

 __syncthreads();

// some other stuff: inits, mem reads...

for (...)

 {

     a=buffer[j];

     b=buffer[j+1];

     ...

 }

This worked fine while there were some local mem reads in the “other stuff”

(introducing some latency). When I changed my “other stuff” to registers to

reduce latency at this point, the buffer showed strange behavior. The elements of

the first for-loop are ok but the next looping steps get always changing crap.

It looks like the buffer isn’t completly initialized and the syncs don’t work. But they

are in the PTX, no weird compiler reordering. I found to ways to fix it:

  1. put a sync in the for loop (before or after reading of buffer, both works)

  2. use global memory to fill the buffer and no 1D texture.

To me this looks very much like a bug. Or maybe I don’t understand this parallel,

coalesced data fetch & cache thing at all :( But the texture reading latency can’t

be controlled with simple sync. It seems you need some mem latency or several

syncs in between to get rid of it.

Has someone found similar problems with texture reads?

Could be I do not completely understand it, but are you trying to both write and read to a texture in the same program?

Hi Wumpus,

no I try to read a part of my data from the texture to an shared buffer.

Every thread reads one array item so it should be a coalesced read.

In the for-loop every thread reads the same array elements (1 to n read should

be ok with shared mem, no bank conflict as long as all read the same address).

Every thread then calculates something with it’s own parameters, but they

work on the same data.

This all works quite well if I use a global mem array. The texture should help

with its caching if the same items are read several times…

Greetings Knax

Very sparse information. Can you provide more code?

buffer[threadIdx.x]=tex1Dfetch(myTex, position+threadIdx.x);

So blockDim.y == 0 ?

    a=buffer[j];

     b=buffer[j+1];

How is j computed?

Peter

Hi, just found out that the reading from shared buffer is the problem, but I don’t understand why?

I did keep it to the problematic parts. The kernel is to cluttered to post.

No, it’s 2D. I use this all the time:

int tid = threadIdx.x+(blockDim.x*threadIdx.y);

...

buffer[tid]=tex1Dfetch(myTex, position+tid);

It works more like this:

int j=0;

for (int k=0; k<MAX; k++)

{

   a=buffer[j++];

   b=buffer[j++];

   ...

   doSomethingWith(a,b,....);

}

The problem came back also with using a global mem array. But it works this way:

int pos;

for(int k=0; k<MAX;k++)

{

   pos=SIZE*k;

   a=buffer[pos+OFFSET_A];

   b=buffer[pos+OFFSET_B];

   ...

}

I don’t understand why this helps. I used an 88 block and now use only 84, so 32 threads in the block, only one warp?

All 32 threads should execute the same instruction, so all 32 read the same shared memory position like in the figure 5-4, left side, p.52 in the CUDA programming guide 0.9. Or am I wrong?

Many thanks

Knax

Just looked into the PTX for the difference of the two versions:

with post increment:

ld.shared.f32  $f72, [$r142+0];

ld.shared.f32  $f73, [$r142+4];

[...]

ld.shared.f32  $f79, [$r142+28];

add.u32  $r142, $r142, 36;

ld.shared.f32  $f80, [$r142+-4];

with constant offset:

ld.shared.f32  $f72, [$r144+0];

ld.shared.f32  $f73, [$r144+4];

[...]

ld.shared.f32  $f79, [$r144+28];

ld.shared.f32  $f80, [$r144+32];

This “shift the base address and then use negative offset” thing the compiler does is a little odd but I don’t understand why it hurts. :wacko:

Bye

Knax

We have also found a similar texture access problem using 0.9. I’m not sure this was a problem in 0.8, but this is our pseudo-code snippet…

#define  DELAY_CNT 100

__global__ void Kernel(int16 *piOut)

{

    

...

    for (uX = 0; uX < 16; uX++)

    {

        for (uWaste = 0; uWaste < DELAY_CNT; uWaste++)

                temp += 1;

        piOut[uBase + uX] = tex2D(m_pInYTx, iX+uX, iY);

...

    }

the little uWaste loop makes it work properly. If we don’t have the uWaste loop, then we get various results reading from the texture:

  1. Sometimes we get a CUDA_ERROR_LAUNCH_FAILED

  2. Sometimes consistent erroneous data read from the texture( the correct data in the wrong position)

  3. Sometimes the stars align, and everything works as it should!

I don’t feel like going back to 0.8, but we don’t think we saw it there.

We are fairly confident it is timing related as we can change the DELAY_CNT and get varying results. Syncthreads gives us either the #2 or #3 effect described above, but the kernel always launches/finishes properly.

Any ideas?

Hi, I don’t understand or know your kernel configuration. Is uBase calculated from the thread and block Ids? Is it meant that every thread copies 16 texels to the global memory? I would not expect this to be a coalesced write.

Another point about texture access in CUDA that is quite unclear to me: there are less texture units per multiprocessor than ALUs? So if all threads read some texel at the same instruction, then there should be some latency, even if every texel is already in a texture cache?

The pseudo code was a simple demonstration of what the problem is, the other performance issues are known and not really important unless texture reads work properly in the simple case.

I have also verified that this is not a problem with 0.8 and is new in 0.9. I’ll try to submit an official bug report.