does this code have problem?

first look at this piece of code:

__global__ void kernel(int* input, int* output, int size)

{

    int tid = threadIdx.x;

    int bid = blockIdx.x;

   int index = BLOCK_SIZE * bid + tid;

   if (index >= size) return;

   __shared__ int tmp[BLOCK_SIZE];

   tmp[tid] = input[index];

   __syncthreads();

   output[index] = tmp[BLOCK_SIZE - tid - 1];

}

this is a simple demo, the idea is to cache BLOCK_SIZE of data into shared memory, and swap them. Question is:

we first check the calculated index, if it beyond the size, then this thread return, but if it is under the size, then try to read one data each thread, and the __syncthreads() is used to make sure all data are loaded, but how about the returned thread? does __syncthreads() wait for that thread? if so, then this code has problem.

anyone can explain something to me? thanks.

You already explained it yourself. This code will deadlock on the GPU and will report an incorrect syncthreads usage if you run it in the emulator.

I just expained what I guess, I don’t know if it is true. So from your reply, seems that’s not a correct coding pattern. so means we should align the array of input and make all the threads follow the same way.

so from my guess, the following code piece is wrong again:

__global__ void foo(...)

{

    int tid = threadIdx.x;

   if (tid > 3)

    {

        // do some job....

        

        __syncthreads();

    }

   __syncthreads();

   // write output

    output[tid] = ....

}

please help me to find out if this code piece is wrong also, so that to make me understand the thread more. thanks.

I think this code is wrong. Problem is that __syncthreads() waits for ALL threads in a block to arrive to some point in program and you have __syncthreads() inside if { } block. Not all threads from a block will execute code inside if { } block and this is why your code will deadlock.

IMO you can remove __synctreads() from inside of the if { } block and your code should be okay.

thanks very much for your help.

well, previously I used this pattern, because I didn’t align the input data, so sometimes the index calculated by the thread idx and block idx may beyond the input array’s max size, so I use if {} block, and I do need __syncthreads() inside if {} block.

so now I have to re-code my program, to align the input data first, so that each threads can not beyond the input’s max size, to through away the if {} block.

does everybody need to algin the input data first? when the input array maybe not the multiple of the block size?

It will always be better to align your input to the nearest multiple of 32. But even then, there may be boundary conditions to handle. There should be nothing wrong with

__global__ void kernel(float *in, float *out, int N)

    {

    float sdata[BLOCK_SIZE];

    if (threadIdx.x < N)

         sdata[threadIdx.x] = in[threadIdx.x];

    else

         sdata[threadIdx.x] = 0.0f;

   __syncthreads();

   // process data in sdata, possibly involving more __syncthreads();

    out[threadIdx.x] = result;

    }

You only need to keep all the extra thread running. Even if there is always an if there to prevent those threads from doing anything, they need to be running and participate in the syncthreads to avoid deadlocks.

Hey, first in your first post you have a return in the middle of the code after an if, that is dangerous, and indeed later you call syncthreads().

but for the current question, you can always take an

if (…)

{

syncthreads;

}

and split it into

if (…)

{

}

syncthreads;

if (…) // re enter the if condition !

{

}

in general in cuda, if you have situations were some threads don’t work u still have to keep them flowing along the code path until the function returns.

if you understand how the architecture works it makes sense.

hope this helps