Speed up my program

MMM · November 19, 2008, 11:59am

Hey. I coded a program to find prime numbers, but it is really slow - slower than the CPU. I wonder how can I speed it up.

The code:

[codebox]#include <conio.h>

#include <stdio.h>

#include <cuda.h>

global void Prime(double* primes, double current)

{

int idx = blockIdx.x * blockDim.x + threadIdx.x;

int iRes;

double dRes, diRes;



//If we have reached the end of the array

if(primes[idx] != -1)

{

	//If we are past the square root of the number - we don't neet to search anymore, it is prime.

	if(pow(primes[idx], 2) > current)

		return;

	//If it divides without reminder

	dRes = current / primes[idx];

	iRes = current / primes[idx];

	diRes = iRes;

	//If it divides without reminder - it is not prime

	while(diRes == dRes)

	{

		primes[0] = -5;	//Use primes[0] as a flag to show that the number isn't a prime

		break;

	}

}else{

	if(idx != 0 && (primes[idx-1] == current || primes[idx-1] == -1))

		return;

	//If the number is a prime

	if(primes[0] != -5)

		primes[idx] = current;	//Add it to the array

	//Restore original value to primes[0] (the first prime number)

	primes[0] = 2;

}

}

/** Checks for errors.

*/

void checkCUDAError(const char *msg)

{

cudaError_t err = cudaGetLastError();

if( cudaSuccess != err) 

{

    fprintf(stderr, "Cuda error: %s: %s.\n", msg, 

                              cudaGetErrorString( err) );

}

}

// main routine that executes on the host

int main(void)

{

double *found_d, *found_h;

int numbers = 200000;

size_t size = numbers * sizeof(double);

found_h = (double *)malloc(size);

//Fill the array of primes with -1

for(int i=0; i < numbers; i++)

	found_h[i] = -1;

//Allocate memory for array of primes on DEVICE

cudaMalloc((void **) &found_d, size);

cudaMemcpy(found_d, found_h, size, cudaMemcpyHostToDevice);

//Do calculation on device:

int block_size = 50, n_blocks = 50;

//Perform prime check for first number (2)

Prime <<< n_blocks, block_size >>> (found_d, 2);

for(int i=3; i < 200; i=i+2)

{

	Prime <<< n_blocks, block_size >>> (found_d, i);

	checkCUDAError("Kernel error");

}

cudaMemcpy(found_h, found_d, size, cudaMemcpyDeviceToHost);

checkCUDAError("Memcpy device to host");

for(int i=0; i < 200000; i++)

{

	if(found_h[i] != -1)

		printf("%lf\n", found_h[i]);

}

//Wait for key to exit

printf("Press any key to exit...\n");

getch();

//Exit

return(0);

}

[/codebox]

Mr_Nuke · November 19, 2008, 4:33pm

Hi,

Looking at your code holistically, I see a few tings that might slow it down.

You are using double precision. If you have a GTX280/260, that will be about 8 times slower than single, You need to understand that anything older than a GT200 does not have hardware double precision support, so it will be way slower than single, as a double operation is split into many single operations(and I do mean many).
You read primes[idx] more than once This will cause several global memory reads, and really slow down the code, even if the reads are coalesced. You want to do something like this:

double localPrime = primes[idx]

at the beginning, and use localPrime instead of primes[idx]; this gives you one coalesced read.
Also, use a register for prime[0], and only write the result back as needed. If I understand right, primes[0] is merely used as a flag, so you can declare a bool instead of using global memory.

Another thing to consider is selecting the right block size. The best way to do that is make it a multiple of 64. then make sure you are calling enough blocks to fully occupy your multiprocessors.

Keep us updated on how this impacts the performane of your code.

seibert · November 20, 2008, 4:23am

Actually, if you compile this kernel for a pre-GT200 card, the compiler will convert all the doubles to floats and run full speed. The compiler does not automatically generate code to emulate double precision with single precision operations. You have to do double-single emulation yourself, in which case, yes, it is very slow.

MMM · November 20, 2008, 9:29am

Thanks for your ideas. I did the thing with localPrime and there was an improvement.

I don’t know, though, how can I use a bool instead of using primes[0]? I need all the threads to know that the number is not a prime, so that bool must be accessed from all threads.

Mr_Nuke · November 20, 2008, 1:40pm

Hm, do you need all the threads to know it is a prime, or just the threads within a block? If the threads within a block works, then you can use a shared bool. I am also suspecting possible syncronization errors using the prime[0] approach. A thread might update primes[0], but then threads that read prime[0] before it gets updated (it takes 300-400 cycles for that) will not “get the message”. The best way might be to find a totally different signaling mechanism.

Also, I see you’re using a block size of 50. That creates two warps (32 therads each), so you get 14 wasted threads / block, or about 22%. The best way is to set the block size to a multiple of 64.

Also, try putting the checkCUDAError outside the for loop. It shouldn’t matter much, but you really don’t need to check for errors after every kernel. If one call succeeds, it is most likely they will all succeed, and vice-versa. Please edit the code in your first post as you make changes to it.

dam5h · November 20, 2008, 5:19pm

Curious as to why is a multiple of 64 better than 32? Do you always want more than one warp on a block? Or is 64 just a suggestion rather than 50 in this particular case?

Thanks.

Mr_Nuke · November 21, 2008, 2:20am

64, not 32. This is to lower the impact of register bank access.

CUDA Programming Guide 2.0, page 67

In reality, the situation is a little more complex. Ideally, you would look at the cubin file, get the amounts of shared memory and register memory, and use the occupancy calculator to find the block size with the highest occupancy (which will always be a multiple of 64 for the reason named above).

Topic		Replies	Views
[code review] my 1st CUDA program - need feedback - Sieve of Eratosthenes CUDA Programming and Performance	2	1562	July 17, 2014
The Problem with Primes CUDA Programming and Performance	15	6545	October 10, 2010
Number of items that can be processed in CUDA CUDA Programming and Performance cuda , kernel	5	302	February 17, 2024
Cuda Latency problems Slow Cuda CUDA Programming and Performance	15	13927	September 5, 2008
[solved] Why can't I compute primes past 85 million? (sieve of eratosthenes) CUDA Programming and Performance	2	973	July 17, 2014
Help in write from local mem to the global CUDA Programming and Performance	5	5913	January 12, 2009
How to run with large number of blocks? CUDA Programming and Performance	6	675	June 1, 2011
Expected performance CUDA Programming and Performance	7	1188	April 12, 2013
CUDA perormances CUDA Programming and Performance	10	7126	January 22, 2008
need a help from employees or guys who know compiler well CUDA Programming and Performance	22	8610	December 18, 2008

Speed up my program

Related topics