syncthread race condition

norrathe · December 13, 2012, 5:13pm

Hi guys,

I’m using the following code but the result turns out to be NaN for large size of input data but small input works fine

__global__ void fastComputeActivation(const float *input, float *output, int width, int height, int depth, const int size)
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int row = (i%(width*height))/width;
    int col = (i%(width*height))%width;
    int d = i/(width*height);

    if (i < size)
    {
        int idx = d+depth*row;
        int inIdx = d*height*width+row*width;
        if(col == 0)    output[idx] = 0;
        __syncthreads();
        atomicAdd(&(output[idx]),input[inIdx+col]);
        __syncthreads();
        if(col == 0)
        {
            output[idx] = sigmoid(output[idx]);
        }
    }

}

I am new to CUDA so is syncthreads working across blocks?
I believe it does not. so I want something similar to syncthreads but like global barrier among a grid.
Or is there other problems in my code?

Thanks,

Gorune · December 13, 2012, 11:19pm

Hi,

I want to note a few things:

_syncthreads() is a barrier where all threads of a block must check before moving, if you put it in an if statement that can potentially split the threads of a block, you will get undetermined results. You can finish the if clause before the synchronization point and restart another clause with the same condition.
Why don’t you use threads in the y and z directions, threadIdx.y/z, it is made to help you think in the natural dimensions.
According to what I understand from the code, you are basically first adding the entries of entire rows then applying some function to the sum. The second __syncthreads() looks like it wants to guarantee that the sum is compelte, however if, say there are 2 blocks that have threads working on the same row, and the first blocks adds all the entries synchronizes and moves on and applies the function to the current sum and then the second block gets scheduled adds entries to the result of the function and the reapplies the function, you got yourself some unexpected results. I suggest you assign a row to a block, perhaps more than 1 row to a single block, depending on the size of the input, and have the block do a reduction on the row and move on to apply a function or to move on to another block. Multiple blocks can also be used to do the reduction, but you’ll have to be more careful.

In any way you choose to implement, the idea is to make 100% sure that the sum is done properly, and you seem to realize this of course, but the implementation looks like it would fail to do it.

All of this of course provided that I understood your code properly haha.

Topic		Replies	Views
The result is unpredictable. CUDA Programming and Performance	6	1150	October 25, 2013
problem with __syncthreads(); CUDA Programming and Performance	1	1695	December 15, 2011
Problems with __syncthreads() CUDA Programming and Performance	2	937	May 4, 2013
cuda syncthreads fail CUDA Programming and Performance	7	3873	February 22, 2013
does this code have problem? CUDA Programming and Performance	6	3950	December 9, 2007
Syncthreads and Stalling Kernels CUDA Programming and Performance	16	4134	August 26, 2010
Can't get all threads to hit a syncthreads barrier correctly... CUDA Programming and Performance	3	1444	March 21, 2015
__syncthreads screwes calculation CUDA Programming and Performance	2	3418	November 22, 2007
Semantics of __syncthreads CUDA Programming and Performance	18	18229	January 2, 2008
Problem with __syncthreads() It does not work for threads > 64 CUDA Programming and Performance	4	960	April 11, 2012

syncthread race condition

Related topics