Hi,
I am trying to write a kernel which will process an image row by row; if the image has 1000 rows, then I will launch 1000 blocks which have 512 threads (maximum thread per block count in my GTS 450). Each block will process a row of the image, for example if the image size is 1500x1500 then I will launch the kernel with <<<1500,512>>> and each block will process its according row (indicated by the blockIdx.x variable ) in a sliding window fashion. In the first pass of a block the pixels [0,511] in the row[blockId.x] are processed, then the pixels [512,1023] are processed and finally [1023,1499]. I tried to implement the algorithm fully but many strange errors occurred and I had to simplify it such that at the moment it only writes simple colors on the target image via surf2Dwrite function. (By the way, the kernel is directly writing into a D3D9 texture via a surface reference.)
Current state of the code is the following:
__shared__ unsigned int passCount;
__global__ void ShadowKernel2(RayMarchInitInfo* input)
{
unsigned int tid;
float dx;
float dy;
float dz;
//Initialization Step 1, global to shared memory transfer, done by thread 0.
if(threadIdx.x == 0)
{
passCount=0;
__threadfence();
}
__syncthreads();
dx=const_dx[0];
//dy=const_deltaY[0];
//dz=const_dz[0];
__syncthreads();
while(true)
{
__syncthreads();
tid=passCount*blockDim.x + threadIdx.x;
__syncthreads();
surf2Dwrite(0xFFFFFF00, surf_Default, (tid) * sizeof(unsigned int), blockIdx.x);
__syncthreads();
if(tid >= 750)
break;
__syncthreads();
if(threadIdx.x == 0)
{
passCount++;
__threadfence();
}
__syncthreads();
}
}
This kernel is only for debugging purposes; there is a shared memory variable called passCount, which I change via thread 0 in the current block. This variable controls the current position of the whole block on its according image row. Each thread of the block writes a color to the position tid=passCount*blockDim.x + threadIdx.x. After writing to the image, if a thread’s offset from the first pixel of the row is greater than 750(I made this number constant in order to simplify the debugging) the thread exits from its main loop. Since only thread 0 controls the shared variable passCount, I placed a __threadfence() whenever it changes the value of it. Moreover, I explicitly synchronized ALL the threads in the block to avoid any undefined behavior by placing a __syncthreads() after every line. const_dx[0] is a constant memory array which I used in the original algorithm but it doesn’t do anything meaningful in the current code.
When I run this kernel with the line dx=const_dx[0] open and with the call ShadowKernel2<<<1400,512>>>(inputList) it outputs the following image:
According to my observations, some of the threads in some blocks erroneously reads passCount as 2 where the correct value is 1 and they appear as shifted to right by 512 pixels. This happens despite all the __syncthreads() and __threadfence() calls after each line and I can’t find any meaningul explanation to that, because all threads are heavily synchronized!
Only clue I have is, If I remove or comment out the line dx=const_dx[0] the kernel works like expected:
So I am absolutely clueless what causes this error. The access to the shared memory is heavily synchronized so there can’t be any race condition between threads. So I desperately need any advice on this situation.
Thanks in advance.