Slow shared memory access and incorrect parallel reduction results

I have a kernel to calculate primes using Eratosthenes sieve, typically in the host I am calculating primes from 2,sqroot(N), and then for finding primes from sqroot(N),N I am using a GPU. The version that uses global memory works fine, but I have some issues with the kernel that uses shared memory:

__global__ void sieve ( int *o_flags, long int sqrootN, long int N )

{

        volatile unsigned int gid = blockIdx.x*blockDim.x+threadIdx.x, tid = threadIdx.x;

        __shared__ int s_flags[NTHREADS];

if ( gid > sqrootN && gid < N)

        {

                s_flags[tid] = 1;

                for (unsigned int j=2; j<=sqrootN; j++)

                {

                        if ( gid % j == 0 )

                                s_flags[tid] = 0;

                }

        }

        else

                return;

__syncthreads();

        //reduce

        reduce(s_flags, tid, o_flags);

}

The reduction code is as follows:

__device__ void warpReduce (volatile int* sdata, int tid)

{

        sdata[tid] += sdata[tid + 32];

        sdata[tid] += sdata[tid + 16];

        sdata[tid] += sdata[tid + 8];

        sdata[tid] += sdata[tid + 4];

        sdata[tid] += sdata[tid + 2];

        sdata[tid] += sdata[tid + 1];

}

__device__ void reduce ( volatile int *sdata, int tid, int *gdata)

{

        for(unsigned int s=blockDim.x/2; s>32; s>>=1)

        {

                if (tid < s)

                        sdata[tid] += sdata[tid + s];

        }

if (tid < 32)

                warpReduce(sdata, tid);

//write results of this block to the global memory

        if (tid == 0)

                gdata[blockIdx.x] = sdata[0];

}

There are two questions — 1. This kernel is very very slow as compared to the version that uses global memory, I guess this is because of the for loop (when I remove it, then the time is acceptable); how would I rectify that? 2. This kernel gives incorrect results; when I did the reduction at host side, I noticed that after a certain data size (100000), this kernel was producing the same (incorrect) results for subsequent data sizes. And even for data sizes less than 100000, the reduction results are incorrect. How do I fix this? I am more concerned about #1. Cross post to stackoverflow - Parallel Reduction in CUDA for calculating primes - Stack Overflow

Thanks.

It would appear your reduction code assumes your data to be a power of 2. Also, your sieve function won’t have all its threads hitting your __syncthreads() call, which is most likely not what you were looking for. More specifically, __syncthreads() is supposed to wait until all your threads in your block reach it. Or, as the programming guide puts it, “__syncthreads() acts as a barrier at which all threads in the block must wait before any is allowed to proceed”. Not to bludgeon you with the obvious, but if your threads exit prematurely, they won’t be reaching your __syncthreads() call.

Thank you for responding. A StackOverflow member warned me of the same thing, but I guess I didn’t understood him at that time. I guess you are saying because of the condition I have based on gid, some threads would not hit __syncthreads? So if I have the following code block (please do not consider correctness), then all the threads in a block hits the barrier?

__global__ void sieve ( int *o_flags, long int sqrootN, long int N )

        {

        unsigned int gid = blockIdx.x*blockDim.x+threadIdx.x, tid = threadIdx.x;

        volatile __shared__ int s_flags[NTHREADS];

s_flags[tid] = 1;

        for (unsigned int j=2; j<=sqrootN; j++)

        {

               if ( gid % j == 0 )

                    s_flags[tid] = 0;

        }

__syncthreads();

        //reduce

        reduce(s_flags, tid, o_flags);

        }

I timed the above code after removing the conditions on gid, and noticed that it is still slow. So is it a good idea to use shared memory in my case where I have to loop over each tid? Thanks.

A suggestion to speed up this code is to test for j=2 separately and start j at 3, incrementing by 2. That would avoid testing all the even numbers you are currently checking but which were effectively tested by j=2.

In your for loop, once any of your mod ops sets the sflag for that tid to zero, you might as well break since you’ve eliminated that one. That’s a matter of style maybe more than anything since some of your gids will actually be prime and not be eliminated, so some of your threads will try all the values of j, meaning the eliminated threads will have to wait on the prime threads anyway (at the syncthreads point). Maybe it’ll save electricity is all.

bf