GPU atomic division

Hi Everyone,

I have a race condition problem with below code, so I want to resolve it with atomic division , but we don’t have this atomic function, anybody have ideas? Thank you so much.

// do reduction in shared mem
for(unsigned int s=blockDim.x/2; s>0; s>>=1) 
{
    if (tid < (blockSize - s)) 
    {
        // race condition problem starts
        myProduct *= sdata[tid + s];
    while( myProduct > 0x0000FFFF){
	myProduct = myProduct/256;
    }
    sdata[tid] = myProduct;
        // race condition problem ends
    }
    __syncthreads();
}

Thanks
Rock

If you are on compute capability 2.x, you can store the logarithm and use atomicAdd().

Hi Tera,

How can I convert division product = product/256 to add ? You mean product = product + (- product * 255/256) ? But this would be slow, right?

Sorry, I didn’t look at your actual code.

Now that I have: This looks like you attempted a reduction, but got the if-condition wrong, introducing a write-after-read hazard. No need for atomic operations, just fix the synchronization. The following code should work (assuming I’m not yet too tired…)

// do reduction in shared mem

    for(unsigned int s=blockDim.x/2; s>0; s>>=1) {

        if (tid < s) {

            myProduct *= sdata[tid + s];

            while( myProduct > 0x0000FFFF) {

                myProduct = myProduct/256;

            }

        sdata[tid] = myProduct;

        }

        __syncthreads();

    }

If I’ve been too tired and the code doesn’t work, try inserting a [font=“Courier New”]__syncthreads();[/font] between after the [font=“Courier New”]myProduct *=[/font].

Hi Tera,

Thank you for the quick response. Actually my requirement is not same with reduction. What I want is:

Say we have p1 p2 p3 p4, I need p1, p1p2, p1p2p3, p1p2p3p4,

so I need tid < (blockSize - s), I use N thread for N elements. And (blockSize == blockDim.x )

How about your idea?

Ok, in that case try inserting the __syncthreads() to prevent the shared memory variables being overwritten before they are read.

it works after inserting __syncthreads, thank you so much :)

Hi Tera,

For my requirement, can I also can optimize with unroll loops and template parameters like reduction example in CUDA SDK? Many thanks :)

Yes, of course you can.

If you try leaving out the __syncthreads() once there is only one warp running, don’t forget to declare the shared mem var as volatile.

Hi,

This looks like a prefix scan algorithm.

http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/scan/doc/scan.pdf

Thank you, brano. :)

Well, atomic division can prove more difficult than changing prejudice.

Hi Tera,

I want to optimize the code according to kernel 5 in Nvidia parallel reduction example http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf, but I have a problem:

for example: we have p1, p2, p3, …p8, This function is only for calculating P1+P2+P3…+P8, the final result is P12345678, P7531, p62, p51, p4, p3, p2, p1 using warp, but if we need get all results of p1+p2+p3…+p8, p1+p2+p3+…P7, p1+p2+p3…+p6, …p2+p1, p1, do you have any suggestion?

template <class T, unsigned int blockSize>

__global__ void

reduce5_bak(T *g_idata, T *g_odata, unsigned int n)

{

    T *sdata = SharedMemory<T>();

// perform first level of reduction,

    // reading from global memory, writing to shared memory

    unsigned int tid = threadIdx.x;

    unsigned int i = blockIdx.x*(blockSize*2) + threadIdx.x;

T mySum = (i < n) ? g_idata[i] : 0;

    if (i + blockSize < n) 

        mySum += g_idata[i+blockSize];  

sdata[tid] = mySum;

    __syncthreads();

// do reduction in shared mem

    if (blockSize >= 512) { if (tid < 256) { sdata[tid] = mySum = mySum + sdata[tid + 256]; } __syncthreads(); }

    if (blockSize >= 256) { if (tid < 128) { sdata[tid] = mySum = mySum + sdata[tid + 128]; } __syncthreads(); }

    if (blockSize >= 128) { if (tid <  64) { sdata[tid] = mySum = mySum + sdata[tid +  64]; } __syncthreads(); }

#ifndef __DEVICE_EMULATION__

    if (tid < 32)

#endif

    {

        // now that we are using warp-synchronous programming (below)

        // we need to declare our shared memory volatile so that the compiler

        // doesn't reorder stores to it and induce incorrect behavior.

        volatile T* smem = sdata;

        if (blockSize >=  64) { smem[tid] = mySum = mySum + smem[tid + 32]; EMUSYNC; }

        if (blockSize >=  32) { smem[tid] = mySum = mySum + smem[tid + 16]; EMUSYNC; }

        if (blockSize >=  16) { smem[tid] = mySum = mySum + smem[tid +  8]; EMUSYNC; }

        if (blockSize >=   8) { smem[tid] = mySum = mySum + smem[tid +  4]; EMUSYNC; }

        if (blockSize >=   4) { smem[tid] = mySum = mySum + smem[tid +  2]; EMUSYNC; }

        if (blockSize >=   2) { smem[tid] = mySum = mySum + smem[tid +  1]; EMUSYNC; }

    }

// write result for this block to global mem 

    if (tid == 0) g_odata[blockIdx.x] = sdata[0];

}