Help with prime algorithmen

Hello,

I’m trying to calculate primenumbers on a GPU.

I tried it with opencl and cuda, but the appilcation always terminates when the number exceeds 10^7.

#define THREADSPERGPU 250000

__global__ void

testKernel( bool* g_odata, int gSize) 

{

	const unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x;

	unsigned int actualNumber = tid;

	bool prime;

	while(actualNumber<gSize)

	{	

		prime = true;

		for(unsigned int i=2;i*i<=actualNumber;i++)

		{

 			if(!(actualNumber%i))

 			{

 				prime = false;

 				break;

 			}

		}

		g_odata[actualNumber] = prime; 

		actualNumber+=THREADSPERGPU;

	}

}

Till 10^6 I have no problems. CUDA exits with cudaError_enum at memory location. I tried this config in emudebug without any errors.

I also tried to write my own modulo function:

float r1 =actualNumber/(float)i;

int r2 = (int) r1;

if((r1-(float)r2)==0.0)

{

  prime = false;

  break;

}

This does function, also above 10^7, but because of the bad rounding capability of the device, most of the results are wrong (but in emudebug there are no errors).

So what do I do wrong or are there any restrictions, that I don’t know?

Thanks

PS: My device is a GTS250 with 1024 MB RAM, that should be enough for 10^9

May b, you are hitting the “block” and “Grid” limits…

A block can have 512 threads max (total of all 3 dimensions)

The max total threads per kernel launch is still a very big number… So, i would not suspect the gridDim.

Thanks for the answer.

The programm always uses 65535 Threads at max.

unsigned int primeLimit= pow(10.0,7);

unsigned int num_threads;

for(int l=512;l>0;l--)

{

	if(!(primeLimit%l))

	{

		num_threads=l;

		break;

	}

}

int gridDim = num_threads;

int blockDim = num_threads;
testKernel<<< gridDim, blockDim >>>(d_oData,primeLimit);

Each thread increases its TID by 65535 until it exceeds the prime Limit.

When using OpenCL u don’t have to think about thread or grid limits, it’s all done automatically. And there I’ve the same problems.

I think the problem is the modulo function. When I delete the modulo part in my kernel, everything work fine. Could there be a modulo restriction?

Has anyone a usable alternative to the modulo function?

Thx

unsigned int primeLimit= pow(10.0,7);

unsigned int num_threads;

for(int l=512;l>0;l--)

{

	if(!(primeLimit%l))

	{

		num_threads=l;

		break;

	}

}

int gridDim = num_threads;

int blockDim = num_threads;

So you have num_threads*num_threads number of threads per kernel launch.

I don’t think 65535 is a perfect square. It is (2^8*2^8 - 1).

255*255 = 65025

256*256 = 65536

I dont understand the algorithm that you are using to choose num_threads.

You are looking for a divisor of 10^7 that is closest to 512 (down to 0)

Well, I ran your code on my laptop.

And, “num_threads” value is 500 for 10^7 (as well as 10^6)… Which is very understandable… For any power of 10 greater than 2, 500 will always divide it…

So, effectively you are using 500 threads in a block with 500 blocks which totals to 250,000 threads…

So, what is 65535 all about…?

Sorry, my expressions are missleading and a little confused.

I first implemented this algorithmen in openCL. There u have a limit of 65536 threads per GPU. So when u try to test 10^7 numbers each thread has to do 10^7 / 65536 passes. So I take the thread ID and add 65535(0-65535=>65536) after each pass until it exceeds the primelimit (10^7). So each thread tests about 152 numbers.
After experiencing problems when reaching 10^7, I tried to do the same with CUDA.
The number of threads and blocks must be even or not? So I decided to find the maximum threads by divding the primelimit until I find an even divider for the grid size. The next even divider would be 500, so 500 threads and 20.000 blocks. But this didn’t work, so I took 500 threads and 500 blocks (250.000 threads) and divided 10^7 by this. Now the thread ID is incremented by 250.000 each pass (40 tests per thread). I forget to write it here, because I was still playing around with my opencl code.

I will edit my first post.

Thanks again for your replies.

Ok,

btw, What is the cudaError_enum are you getting?

Is that a “Unspecified launch failure” – that means you are writing beyond what you are allocating.

Check if you are using correct sizes for “cudaMalloc”. Check if the cudaMalloc() succeeded in the first case.

Good Luck!