In order to learn this stuff I decided to write a classic programming problem with Cuda, a function to check if a number is a prime number. I’ve written a C# version and a version using Cuda. The C# version works correctly and takes about 16 seconds on my system to tell me that 2147483647 is a prime number, in fact it is the largest prime signed 32bit integer and it is the number I am testing with for the upper range, even though I used unsigned integers in the function. The version written in C, for cuda, works on small prime numbers but it tells me that 2147483647 is divisible by 536870912, if you divide those two numbers you get 3.999999998137… Clearly there are some precision issues where ((num %) == 0) is not good enough, although it is in C#. I haven’t used C since college, what am I missing here, why is this not working on larger numbers?

extern "C" __global__ void
Parallel_isPrime(const float * input, unsigned long int * output)
{
int tid = threadIdx.x;
int dib = blockIdx.x;
unsigned long int max = rintf((input[0] / 2));
unsigned long int num = input[0];
for (unsigned long int i = (tid + 2) * (dib + 1); i < max; i += 256 * (dib + 1))
{
if ((num % i) == 0)
output[0] = i;
}
}

As you can see the function divides the iteration of the loop among the threads, I hard coded the number of threads at 256 because I don’t know how to get the thread count from the device code, other than passing that value to the device.

I would guess this has something to do with how modulo is implemented. If it is implemented as something like: prime - floor(prime / factor) * factor, then in single precision the result of that division on the card is exactly 4, so it doesn’t get floored to 3. Then the result is that the remainder is -1 instead of 536870911. How exactly the code then handles negative remainders, I don’t know…maybe it flushes them to 0 instead of just returning them?

You’re passing in your arguments as FLOATs. That’s bad… a float can only represent integers exactly up to about 8000000. So you’re passing in your number, but its not really 2147483647, it’s the floating point version of that, truncated to probably the top 23 bits with the low order bits 0.
That number isn’t prime.

Easy fix: don’t pass in your argument as a float! Pass it in as an unsigned integer.

With further refinement I’m able to check 10,000 numbers starting at the largest signed 32bit integer prime (2147483647) in 1076ms. Very impressive, I don’t think the CPU could even come close to this rate. I have verified the output with a starting number of 100,000 and it appears to be producing valid data.

extern "C" __global__ void
Parallel_isPrime(const unsigned long long int * input, unsigned long long int * output)
{
int tid = threadIdx.x;
int dib = blockIdx.x;
//input[0] = number to start with
//input[1] = number of threads
//input[2] = number of numbers to test for prime starting at input[0]
//input[3] = number of blocks
unsigned long long int max;
unsigned long long int num = input[0] + dib;
for (int y = dib; y < input[2]; y += input[3])
{
max = rint(sqrt((double)num));
for (unsigned long long int i = tid; i < max; i += input[1])
{
if (i > 1)
{
if ((num % i) == 0)
{
output[y] = i;
break;
}
}
}
num += input[3];
__syncthreads();
}
}