just found some strange behavior in CUDA 0.9. Maybe it’s a bug,
but I’m not sure. It’s this (pseudo) code fragment:
__shared__ float buffer[BLOCKDIM];
// lot of stuff
__syncthreads();
buffer[threadIdx.x]=tex1Dfetch(myTex, position+threadIdx.x);
__syncthreads();
// some other stuff: inits, mem reads...
for (...)
{
a=buffer[j];
b=buffer[j+1];
...
}
This worked fine while there were some local mem reads in the “other stuff”
(introducing some latency). When I changed my “other stuff” to registers to
reduce latency at this point, the buffer showed strange behavior. The elements of
the first for-loop are ok but the next looping steps get always changing crap.
It looks like the buffer isn’t completly initialized and the syncs don’t work. But they
are in the PTX, no weird compiler reordering. I found to ways to fix it:
put a sync in the for loop (before or after reading of buffer, both works)
use global memory to fill the buffer and no 1D texture.
To me this looks very much like a bug. Or maybe I don’t understand this parallel,
coalesced data fetch & cache thing at all :( But the texture reading latency can’t
be controlled with simple sync. It seems you need some mem latency or several
syncs in between to get rid of it.
Has someone found similar problems with texture reads?
I don’t understand why this helps. I used an 88 block and now use only 84, so 32 threads in the block, only one warp?
All 32 threads should execute the same instruction, so all 32 read the same shared memory position like in the figure 5-4, left side, p.52 in the CUDA programming guide 0.9. Or am I wrong?
the little uWaste loop makes it work properly. If we don’t have the uWaste loop, then we get various results reading from the texture:
Sometimes we get a CUDA_ERROR_LAUNCH_FAILED
Sometimes consistent erroneous data read from the texture( the correct data in the wrong position)
Sometimes the stars align, and everything works as it should!
I don’t feel like going back to 0.8, but we don’t think we saw it there.
We are fairly confident it is timing related as we can change the DELAY_CNT and get varying results. Syncthreads gives us either the #2 or #3 effect described above, but the kernel always launches/finishes properly.
Hi, I don’t understand or know your kernel configuration. Is uBase calculated from the thread and block Ids? Is it meant that every thread copies 16 texels to the global memory? I would not expect this to be a coalesced write.
Another point about texture access in CUDA that is quite unclear to me: there are less texture units per multiprocessor than ALUs? So if all threads read some texel at the same instruction, then there should be some latency, even if every texel is already in a texture cache?
The pseudo code was a simple demonstration of what the problem is, the other performance issues are known and not really important unless texture reads work properly in the simple case.
I have also verified that this is not a problem with 0.8 and is new in 0.9. I’ll try to submit an official bug report.