I tried it with opencl and cuda, but the appilcation always terminates when the number exceeds 10^7.
#define THREADSPERGPU 250000
__global__ void
testKernel( bool* g_odata, int gSize)
{
const unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int actualNumber = tid;
bool prime;
while(actualNumber<gSize)
{
prime = true;
for(unsigned int i=2;i*i<=actualNumber;i++)
{
if(!(actualNumber%i))
{
prime = false;
break;
}
}
g_odata[actualNumber] = prime;
actualNumber+=THREADSPERGPU;
}
}
Till 10^6 I have no problems. CUDA exits with cudaError_enum at memory location. I tried this config in emudebug without any errors.
I also tried to write my own modulo function:
float r1 =actualNumber/(float)i;
int r2 = (int) r1;
if((r1-(float)r2)==0.0)
{
prime = false;
break;
}
This does function, also above 10^7, but because of the bad rounding capability of the device, most of the results are wrong (but in emudebug there are no errors).
So what do I do wrong or are there any restrictions, that I don’t know?
Thanks
PS: My device is a GTS250 with 1024 MB RAM, that should be enough for 10^9
And, “num_threads” value is 500 for 10^7 (as well as 10^6)… Which is very understandable… For any power of 10 greater than 2, 500 will always divide it…
So, effectively you are using 500 threads in a block with 500 blocks which totals to 250,000 threads…
Sorry, my expressions are missleading and a little confused.
I first implemented this algorithmen in openCL. There u have a limit of 65536 threads per GPU. So when u try to test 10^7 numbers each thread has to do 10^7 / 65536 passes. So I take the thread ID and add 65535(0-65535=>65536) after each pass until it exceeds the primelimit (10^7). So each thread tests about 152 numbers.
After experiencing problems when reaching 10^7, I tried to do the same with CUDA.
The number of threads and blocks must be even or not? So I decided to find the maximum threads by divding the primelimit until I find an even divider for the grid size. The next even divider would be 500, so 500 threads and 20.000 blocks. But this didn’t work, so I took 500 threads and 500 blocks (250.000 threads) and divided 10^7 by this. Now the thread ID is incremented by 250.000 each pass (40 tests per thread). I forget to write it here, because I was still playing around with my opencl code.