Speed up my program

Hey. I coded a program to find prime numbers, but it is really slow - slower than the CPU. I wonder how can I speed it up.

The code:

[codebox]#include <conio.h>

#include <stdio.h>

#include <cuda.h>

global void Prime(double* primes, double current)

{

int idx = blockIdx.x * blockDim.x + threadIdx.x;

int iRes;

double dRes, diRes;



//If we have reached the end of the array

if(primes[idx] != -1)

{

	//If we are past the square root of the number - we don't neet to search anymore, it is prime.

	if(pow(primes[idx], 2) > current)

		return;

	//If it divides without reminder

	dRes = current / primes[idx];

	iRes = current / primes[idx];

	diRes = iRes;

	//If it divides without reminder - it is not prime

	while(diRes == dRes)

	{

		primes[0] = -5;	//Use primes[0] as a flag to show that the number isn't a prime

		break;

	}

}else{

	if(idx != 0 && (primes[idx-1] == current || primes[idx-1] == -1))

		return;

	//If the number is a prime

	if(primes[0] != -5)

		primes[idx] = current;	//Add it to the array

	//Restore original value to primes[0] (the first prime number)

	primes[0] = 2;

}

}

/** Checks for errors.

*/

void checkCUDAError(const char *msg)

{

cudaError_t err = cudaGetLastError();

if( cudaSuccess != err) 

{

    fprintf(stderr, "Cuda error: %s: %s.\n", msg, 

                              cudaGetErrorString( err) );

}

}

// main routine that executes on the host

int main(void)

{

double *found_d, *found_h;

int numbers = 200000;

size_t size = numbers * sizeof(double);

found_h = (double *)malloc(size);

//Fill the array of primes with -1

for(int i=0; i < numbers; i++)

	found_h[i] = -1;

//Allocate memory for array of primes on DEVICE

cudaMalloc((void **) &found_d, size);

cudaMemcpy(found_d, found_h, size, cudaMemcpyHostToDevice);

//Do calculation on device:

int block_size = 50, n_blocks = 50;

//Perform prime check for first number (2)

Prime <<< n_blocks, block_size >>> (found_d, 2);

for(int i=3; i < 200; i=i+2)

{

	Prime <<< n_blocks, block_size >>> (found_d, i);

	checkCUDAError("Kernel error");

}

cudaMemcpy(found_h, found_d, size, cudaMemcpyDeviceToHost);

checkCUDAError("Memcpy device to host");

for(int i=0; i < 200000; i++)

{

	if(found_h[i] != -1)

		printf("%lf\n", found_h[i]);

}

//Wait for key to exit

printf("Press any key to exit...\n");

getch();

//Exit

return(0);

}

[/codebox]

Hi,

Looking at your code holistically, I see a few tings that might slow it down.

  1. You are using double precision. If you have a GTX280/260, that will be about 8 times slower than single, You need to understand that anything older than a GT200 does not have hardware double precision support, so it will be way slower than single, as a double operation is split into many single operations(and I do mean many).

  2. You read primes[idx] more than once This will cause several global memory reads, and really slow down the code, even if the reads are coalesced. You want to do something like this:

double localPrime = primes[idx]

at the beginning, and use localPrime instead of primes[idx]; this gives you one coalesced read.
Also, use a register for prime[0], and only write the result back as needed. If I understand right, primes[0] is merely used as a flag, so you can declare a bool instead of using global memory.

Another thing to consider is selecting the right block size. The best way to do that is make it a multiple of 64. then make sure you are calling enough blocks to fully occupy your multiprocessors.

Keep us updated on how this impacts the performane of your code.

Actually, if you compile this kernel for a pre-GT200 card, the compiler will convert all the doubles to floats and run full speed. The compiler does not automatically generate code to emulate double precision with single precision operations. You have to do double-single emulation yourself, in which case, yes, it is very slow.

Thanks for your ideas. I did the thing with localPrime and there was an improvement.

I don’t know, though, how can I use a bool instead of using primes[0]? I need all the threads to know that the number is not a prime, so that bool must be accessed from all threads.

Hm, do you need all the threads to know it is a prime, or just the threads within a block? If the threads within a block works, then you can use a shared bool. I am also suspecting possible syncronization errors using the prime[0] approach. A thread might update primes[0], but then threads that read prime[0] before it gets updated (it takes 300-400 cycles for that) will not “get the message”. The best way might be to find a totally different signaling mechanism.

Also, I see you’re using a block size of 50. That creates two warps (32 therads each), so you get 14 wasted threads / block, or about 22%. The best way is to set the block size to a multiple of 64.

Also, try putting the checkCUDAError outside the for loop. It shouldn’t matter much, but you really don’t need to check for errors after every kernel. If one call succeeds, it is most likely they will all succeed, and vice-versa. Please edit the code in your first post as you make changes to it.

Curious as to why is a multiple of 64 better than 32? Do you always want more than one warp on a block? Or is 64 just a suggestion rather than 50 in this particular case?

Thanks.

64, not 32. This is to lower the impact of register bank access.

CUDA Programming Guide 2.0, page 67

In reality, the situation is a little more complex. Ideally, you would look at the cubin file, get the amounts of shared memory and register memory, and use the occupancy calculator to find the block size with the highest occupancy (which will always be a multiple of 64 for the reason named above).