Slow shared memory access and incorrect parallel reduction results

_Sayan · September 27, 2011, 11:50pm

I have a kernel to calculate primes using Eratosthenes sieve, typically in the host I am calculating primes from 2,sqroot(N), and then for finding primes from sqroot(N),N I am using a GPU. The version that uses global memory works fine, but I have some issues with the kernel that uses shared memory:

__global__ void sieve ( int *o_flags, long int sqrootN, long int N )

{

        volatile unsigned int gid = blockIdx.x*blockDim.x+threadIdx.x, tid = threadIdx.x;

        __shared__ int s_flags[NTHREADS];

if ( gid > sqrootN && gid < N)

        {

                s_flags[tid] = 1;

                for (unsigned int j=2; j<=sqrootN; j++)

                {

                        if ( gid % j == 0 )

                                s_flags[tid] = 0;

                }

        }

        else

                return;

__syncthreads();

        //reduce

        reduce(s_flags, tid, o_flags);

}

The reduction code is as follows:

__device__ void warpReduce (volatile int* sdata, int tid)

{

        sdata[tid] += sdata[tid + 32];

        sdata[tid] += sdata[tid + 16];

        sdata[tid] += sdata[tid + 8];

        sdata[tid] += sdata[tid + 4];

        sdata[tid] += sdata[tid + 2];

        sdata[tid] += sdata[tid + 1];

}

__device__ void reduce ( volatile int *sdata, int tid, int *gdata)

{

        for(unsigned int s=blockDim.x/2; s>32; s>>=1)

        {

                if (tid < s)

                        sdata[tid] += sdata[tid + s];

        }

if (tid < 32)

                warpReduce(sdata, tid);

//write results of this block to the global memory

        if (tid == 0)

                gdata[blockIdx.x] = sdata[0];

}

There are two questions — 1. This kernel is very very slow as compared to the version that uses global memory, I guess this is because of the for loop (when I remove it, then the time is acceptable); how would I rectify that? 2. This kernel gives incorrect results; when I did the reduction at host side, I noticed that after a certain data size (100000), this kernel was producing the same (incorrect) results for subsequent data sizes. And even for data sizes less than 100000, the reduction results are incorrect. How do I fix this? I am more concerned about #1. Cross post to stackoverflow - Parallel Reduction in CUDA for calculating primes - Stack Overflow

Thanks.

alrikai · September 28, 2011, 12:30am

It would appear your reduction code assumes your data to be a power of 2. Also, your sieve function won’t have all its threads hitting your __syncthreads() call, which is most likely not what you were looking for. More specifically, __syncthreads() is supposed to wait until all your threads in your block reach it. Or, as the programming guide puts it, “__syncthreads() acts as a barrier at which all threads in the block must wait before any is allowed to proceed”. Not to bludgeon you with the obvious, but if your threads exit prematurely, they won’t be reaching your __syncthreads() call.

_Sayan · September 28, 2011, 2:52am

Thank you for responding. A StackOverflow member warned me of the same thing, but I guess I didn’t understood him at that time. I guess you are saying because of the condition I have based on gid, some threads would not hit __syncthreads? So if I have the following code block (please do not consider correctness), then all the threads in a block hits the barrier?

__global__ void sieve ( int *o_flags, long int sqrootN, long int N )

        {

        unsigned int gid = blockIdx.x*blockDim.x+threadIdx.x, tid = threadIdx.x;

        volatile __shared__ int s_flags[NTHREADS];

s_flags[tid] = 1;

        for (unsigned int j=2; j<=sqrootN; j++)

        {

               if ( gid % j == 0 )

                    s_flags[tid] = 0;

        }

__syncthreads();

        //reduce

        reduce(s_flags, tid, o_flags);

        }

I timed the above code after removing the conditions on gid, and noticed that it is still slow. So is it a good idea to use shared memory in my case where I have to loop over each tid? Thanks.

bunnyFair · October 5, 2011, 11:25am

A suggestion to speed up this code is to test for j=2 separately and start j at 3, incrementing by 2. That would avoid testing all the even numbers you are currently checking but which were effectively tested by j=2.

In your for loop, once any of your mod ops sets the sflag for that tid to zero, you might as well break since you’ve eliminated that one. That’s a matter of style maybe more than anything since some of your gids will actually be prime and not be eliminated, so some of your threads will try all the values of j, meaning the eliminated threads will have to wait on the prime threads anyway (at the syncthreads point). Maybe it’ll save electricity is all.

bf

Topic		Replies	Views
IS __syncthread() resetting shared memory values? CUDA Programming and Performance	2	718	August 9, 2018
Does the volatile keyword work properly on global memory CUDA Programming and Performance	4	1340	August 17, 2022
I want to ask parallel reduction.... CUDA Programming and Performance	0	484	January 30, 2019
shared memory problems CUDA Programming and Performance	12	4988	October 2, 2010
CUDA - Sieve of Eratosthenes division into parts CUDA Programming and Performance	6	2142	June 22, 2015
Serializing and after parallelism again CUDA Programming and Performance	8	2472	July 12, 2008
Best way to pack bits into words for global memory Better than reduce in shared memory? CUDA Programming and Performance	17	6693	June 2, 2012
__syncthreads question CUDA Programming and Performance	9	2031	September 30, 2009
[code review] my 1st CUDA program - need feedback - Sieve of Eratosthenes CUDA Programming and Performance	2	1566	July 17, 2014
Reduction & block dimension Using the easiest reduction example of the SDK CUDA Programming and Performance	6	2231	November 23, 2009

Slow shared memory access and incorrect parallel reduction results

Related topics