The Problem with Primes

boboclock367 · October 5, 2010, 8:31pm

Hey Cuda Community,

I have been running this code to generate a massive number of primes up to a certain number in a brute force manner. My goal is to get this code running on every machine in a computer lab (all machines are identical). The code I currently have runs correctly on a few machines, but I am encountering a number of errors on the rest of them:

The program prints all primes up to a certain number which is lower than the desired maximum
The program prints out many numbers, most of which not prime, up to the maximum (I find this one most odd, as when the program does this the values in PrimeList_h are all 0, however they still print even though printf is within if(PrimeList_h[i] != 0)
Combination of the last two

The code is as follows:

[codebox]

#include <stdio.h>

#include <math.h> 

#include <cuda.h>

#include <cstdio>

    #include <ctime>

__global__ void primalrage(float *PrimeList, int N)

{

  int i = blockIdx.x * blockDim.x + threadIdx.x;

  int n = 1; 

  for(int j = 2; j < i; j++)

  {

  if (i % j == 0) 

  { n = 0; }

  }

  if (n == 1)

  {

  PrimeList[i] = i;

  }

  else

  PrimeList[i] = 0;

 }

int main(void)

{

double temp;

clock_t start;

double diff;

start = clock();

  float *PrimeList_h, *PrimeList_d;  // Pointer to host & device arrays

  const int N = 100000;  // Number of elements in arrays

  size_t size = N * sizeof(float);

  PrimeList_h = (float *)malloc(size);        // Allocate array on host

  cudaMalloc((void **) &PrimeList_d, size);   // Allocate array on device

  int block_size = 256;

  int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);

dim3 dimBlock(block_size, block_size);

      dim3 dimGrid(n_blocks, n_blocks);

  primalrage <<< dimGrid, dimBlock >>> (PrimeList_d, N);

  cudaMemcpy(PrimeList_h, PrimeList_d, sizeof(float)*N, cudaMemcpyDeviceToHost);

  for (int i=0; i<N; i++) 

  if (PrimeList_h[i] !=0){

     printf("%d %f\n", i, PrimeList_h[i]);

}

  free(PrimeList_h); cudaFree(PrimeList_d);

diff = (clock() - start) / (double)CLOCKS_PER_SEC;

      printf("Time: %f /n",diff);

}[/codebox]

This is simply an example to show a speed up over a serialized version of the code.

Any input would be greatly appreciated.

EDIT: I have updated this post to reflect a few of the changes I have made based on your suggestions

boboclock367 · October 5, 2010, 8:31pm

Hey Cuda Community,

I have been running this code to generate a massive number of primes up to a certain number in a brute force manner. My goal is to get this code running on every machine in a computer lab (all machines are identical). The code I currently have runs correctly on a few machines, but I am encountering a number of errors on the rest of them:

The program prints all primes up to a certain number which is lower than the desired maximum
The program prints out many numbers, most of which not prime, up to the maximum (I find this one most odd, as when the program does this the values in PrimeList_h are all 0, however they still print even though printf is within if(PrimeList_h[i] != 0)
Combination of the last two

The code is as follows:

[codebox]

#include <stdio.h>

#include <math.h> 

#include <cuda.h>

#include <cstdio>

    #include <ctime>

__global__ void primalrage(float *PrimeList, int N)

{

  int i = blockIdx.x * blockDim.x + threadIdx.x;

  int n = 1; 

  for(int j = 2; j < i; j++)

  {

  if (i % j == 0) 

  { n = 0; }

  }

  if (n == 1)

  {

  PrimeList[i] = i;

  }

  else

  PrimeList[i] = 0;

 }

int main(void)

{

double temp;

clock_t start;

double diff;

start = clock();

  float *PrimeList_h, *PrimeList_d;  // Pointer to host & device arrays

  const int N = 100000;  // Number of elements in arrays

  size_t size = N * sizeof(float);

  PrimeList_h = (float *)malloc(size);        // Allocate array on host

  cudaMalloc((void **) &PrimeList_d, size);   // Allocate array on device

  int block_size = 256;

  int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);

dim3 dimBlock(block_size, block_size);

      dim3 dimGrid(n_blocks, n_blocks);

  primalrage <<< dimGrid, dimBlock >>> (PrimeList_d, N);

  cudaMemcpy(PrimeList_h, PrimeList_d, sizeof(float)*N, cudaMemcpyDeviceToHost);

  for (int i=0; i<N; i++) 

  if (PrimeList_h[i] !=0){

     printf("%d %f\n", i, PrimeList_h[i]);

}

  free(PrimeList_h); cudaFree(PrimeList_d);

diff = (clock() - start) / (double)CLOCKS_PER_SEC;

      printf("Time: %f /n",diff);

}[/codebox]

This is simply an example to show a speed up over a serialized version of the code.

Any input would be greatly appreciated.

EDIT: I have updated this post to reflect a few of the changes I have made based on your suggestions

Lev · October 5, 2010, 9:32pm

Block size 1 is anyway bad number, maybe it is limit of block number, 100000 looks too big.
Probably max x-numbers of block is 65535.

Lev · October 5, 2010, 9:32pm

Block size 1 is anyway bad number, maybe it is limit of block number, 100000 looks too big.
Probably max x-numbers of block is 65535.

boboclock367 · October 5, 2010, 10:17pm

Hey Lev,

I have had some interesting issues with block_size. cudaDeviceProp told me my max threads per block is 512. Oddly enough, when I run this code with block_size = 1 it runs fine (primes up to 100,000 in less than 1 second), and when I use a block_size > 512 it also runs fine, but any block_size lower than 512 and the program does something for about 10 seconds, then outputs the runtime and no primes.

Aside from that, I am trying to have this code up and running on a number of machines in a computer lab (all are identical). The machine I developed this code on runs fine most of the time. I am getting the errors I mentioned before when I run the exact same code on every other machine in the lab. The error itself isn’t even consistent.

boboclock367 · October 5, 2010, 10:17pm

Hey Lev,

I have had some interesting issues with block_size. cudaDeviceProp told me my max threads per block is 512. Oddly enough, when I run this code with block_size = 1 it runs fine (primes up to 100,000 in less than 1 second), and when I use a block_size > 512 it also runs fine, but any block_size lower than 512 and the program does something for about 10 seconds, then outputs the runtime and no primes.

Aside from that, I am trying to have this code up and running on a number of machines in a computer lab (all are identical). The machine I developed this code on runs fine most of the time. I am getting the errors I mentioned before when I run the exact same code on every other machine in the lab. The error itself isn’t even consistent.

tera · October 5, 2010, 11:44pm

Your code does out-of-bounds accesses if [font=“Courier New”]N[/font] is not a multiple of the blocksize. This might explain your trouble with blocksizes other than 1.

tera · October 5, 2010, 11:44pm

Your code does out-of-bounds accesses if [font=“Courier New”]N[/font] is not a multiple of the blocksize. This might explain your trouble with blocksizes other than 1.

boboclock367 · October 6, 2010, 12:49am

Hey everyone,

Thanks for all your help so far, but I’m not quite there yet.

I updated my code so it uses dim3:

[codebox]int block_size = 256;

int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);

dim3 dimBlock(block_size, block_size);

dim3 dimGrid(n_blocks, n_blocks);

primalrage <<< dimGrid, dimBlock >>> (PrimeList_d, N);[/codebox]

And now I am not having any issues when I change block_size .

Nevertheless, the code still doesn’t run correctly on all the machines in my lab. I still get different errors on different computers:

The program prints all primes up to a certain number which is lower than the desired maximum
The program prints out many numbers, most of which not prime, up to the maximum (I find this one most odd, as when the program does this the values in PrimeList_h are all 0, however they still print even though printf is within if(PrimeList_h[i] != 0)
Combination of the last two

I have updated the original post to reflect these changes.

boboclock367 · October 6, 2010, 12:49am

Hey everyone,

Thanks for all your help so far, but I’m not quite there yet.

I updated my code so it uses dim3:

[codebox]int block_size = 256;

int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);

dim3 dimBlock(block_size, block_size);

dim3 dimGrid(n_blocks, n_blocks);

primalrage <<< dimGrid, dimBlock >>> (PrimeList_d, N);[/codebox]

And now I am not having any issues when I change block_size .

Nevertheless, the code still doesn’t run correctly on all the machines in my lab. I still get different errors on different computers:

The program prints all primes up to a certain number which is lower than the desired maximum
The program prints out many numbers, most of which not prime, up to the maximum (I find this one most odd, as when the program does this the values in PrimeList_h are all 0, however they still print even though printf is within if(PrimeList_h[i] != 0)
Combination of the last two

I have updated the original post to reflect these changes.

tera · October 6, 2010, 8:18am

Check the return code for errors. In the new example it is guaranteed that your kernel does not launch at all, as a blocksize of 256x256 is far too large.

Insert[font=“Courier New”]
cudaThreadSynchronize();
printf(“%s\n”, cudaGetErrorString(cudaGetLastError()));[/font]
after the kernel call.

tera · October 6, 2010, 8:18am

Check the return code for errors. In the new example it is guaranteed that your kernel does not launch at all, as a blocksize of 256x256 is far too large.

Insert[font=“Courier New”]
cudaThreadSynchronize();
printf(“%s\n”, cudaGetErrorString(cudaGetLastError()));[/font]
after the kernel call.

Lev · October 6, 2010, 10:55am

Code is spoiled even more. need to write

dim3 dimBlock(block_size, 1);
dim3 dimGrid(n_blocks, 1);

and Terra is right, iif you launch 1000 blocks of 256, and N is not 1000*256, you get out of bound access.

Lev · October 6, 2010, 10:55am

Code is spoiled even more. need to write

dim3 dimBlock(block_size, 1);
dim3 dimGrid(n_blocks, 1);

and Terra is right, iif you launch 1000 blocks of 256, and N is not 1000*256, you get out of bound access.

Ken_g6 · October 10, 2010, 4:28pm

If you’re looking for suggestions on how to generate a lot of small (64-bits or less, say) primes with a GPU, allow me to point you to MersenneForum. :)

Ken_g6 · October 10, 2010, 4:28pm

If you’re looking for suggestions on how to generate a lot of small (64-bits or less, say) primes with a GPU, allow me to point you to MersenneForum. :)

Topic		Replies	Views
How to run with large number of blocks? CUDA Programming and Performance	6	675	June 1, 2011
[code review] my 1st CUDA program - need feedback - Sieve of Eratosthenes CUDA Programming and Performance	2	1565	July 17, 2014
[solved] Why can't I compute primes past 85 million? (sieve of eratosthenes) CUDA Programming and Performance	2	976	July 17, 2014
Speed up my program CUDA Programming and Performance	6	4452	November 21, 2008
Number of items that can be processed in CUDA CUDA Programming and Performance cuda , kernel	5	332	February 17, 2024
limit of computation CUDA Programming and Performance	44	32905	April 8, 2008
CUDA Program Issue CUDA Programming and Performance cuda	19	145	September 20, 2024
summation of prime no in CUDA CUDA Programming and Performance	3	2483	November 3, 2016
String search with many threads CUDA Programming and Performance	11	5940	November 5, 2010
Size limitation for 1D Arrays in CUDA? CUDA Programming and Performance	9	18290	October 17, 2013

The Problem with Primes

Related topics