 # Help with prime algorithmen

Hello,

I’m trying to calculate primenumbers on a GPU.

I tried it with opencl and cuda, but the appilcation always terminates when the number exceeds 10^7.

``````#define THREADSPERGPU 250000

__global__ void

testKernel( bool* g_odata, int gSize)

{

const unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x;

unsigned int actualNumber = tid;

bool prime;

while(actualNumber<gSize)

{

prime = true;

for(unsigned int i=2;i*i<=actualNumber;i++)

{

if(!(actualNumber%i))

{

prime = false;

break;

}

}

g_odata[actualNumber] = prime;

}

}
``````

Till 10^6 I have no problems. CUDA exits with cudaError_enum at memory location. I tried this config in emudebug without any errors.

I also tried to write my own modulo function:

``````float r1 =actualNumber/(float)i;

int r2 = (int) r1;

if((r1-(float)r2)==0.0)

{

prime = false;

break;

}
``````

This does function, also above 10^7, but because of the bad rounding capability of the device, most of the results are wrong (but in emudebug there are no errors).

So what do I do wrong or are there any restrictions, that I don’t know?

Thanks

PS: My device is a GTS250 with 1024 MB RAM, that should be enough for 10^9

May b, you are hitting the “block” and “Grid” limits…

A block can have 512 threads max (total of all 3 dimensions)

The max total threads per kernel launch is still a very big number… So, i would not suspect the gridDim.

The programm always uses 65535 Threads at max.

``````unsigned int primeLimit= pow(10.0,7);

for(int l=512;l>0;l--)

{

if(!(primeLimit%l))

{

break;

}

}

``````
``````testKernel<<< gridDim, blockDim >>>(d_oData,primeLimit);
``````

Each thread increases its TID by 65535 until it exceeds the prime Limit.

When using OpenCL u don’t have to think about thread or grid limits, it’s all done automatically. And there I’ve the same problems.

I think the problem is the modulo function. When I delete the modulo part in my kernel, everything work fine. Could there be a modulo restriction?

Has anyone a usable alternative to the modulo function?

Thx

``````unsigned int primeLimit= pow(10.0,7);

for(int l=512;l>0;l--)

{

if(!(primeLimit%l))

{

break;

}

}

``````

I don’t think 65535 is a perfect square. It is (2^8*2^8 - 1).

255*255 = 65025

256*256 = 65536

I dont understand the algorithm that you are using to choose num_threads.

You are looking for a divisor of 10^7 that is closest to 512 (down to 0)

Well, I ran your code on my laptop.

And, “num_threads” value is 500 for 10^7 (as well as 10^6)… Which is very understandable… For any power of 10 greater than 2, 500 will always divide it…

So, effectively you are using 500 threads in a block with 500 blocks which totals to 250,000 threads…

So, what is 65535 all about…?

Sorry, my expressions are missleading and a little confused.

I first implemented this algorithmen in openCL. There u have a limit of 65536 threads per GPU. So when u try to test 10^7 numbers each thread has to do 10^7 / 65536 passes. So I take the thread ID and add 65535(0-65535=>65536) after each pass until it exceeds the primelimit (10^7). So each thread tests about 152 numbers.
After experiencing problems when reaching 10^7, I tried to do the same with CUDA.
The number of threads and blocks must be even or not? So I decided to find the maximum threads by divding the primelimit until I find an even divider for the grid size. The next even divider would be 500, so 500 threads and 20.000 blocks. But this didn’t work, so I took 500 threads and 500 blocks (250.000 threads) and divided 10^7 by this. Now the thread ID is incremented by 250.000 each pass (40 tests per thread). I forget to write it here, because I was still playing around with my opencl code.

I will edit my first post.