The Problem with Primes

Hey Cuda Community,

I have been running this code to generate a massive number of primes up to a certain number in a brute force manner. My goal is to get this code running on every machine in a computer lab (all machines are identical). The code I currently have runs correctly on a few machines, but I am encountering a number of errors on the rest of them:

  • The program prints all primes up to a certain number which is lower than the desired maximum

  • The program prints out many numbers, most of which not prime, up to the maximum (I find this one most odd, as when the program does this the values in PrimeList_h are all 0, however they still print even though printf is within if(PrimeList_h[i] != 0)

  • Combination of the last two

The code is as follows:

[codebox]

#include <stdio.h>

#include <math.h> 

#include <cuda.h>

#include <cstdio>

    #include <ctime>

__global__ void primalrage(float *PrimeList, int N)

{

  int i = blockIdx.x * blockDim.x + threadIdx.x;

  int n = 1; 

  for(int j = 2; j < i; j++)

  {

  if (i % j == 0) 

  { n = 0; }

  }

  if (n == 1)

  {

  PrimeList[i] = i;

  }

  else

  PrimeList[i] = 0;

 }

int main(void)

{

double temp;

clock_t start;

double diff;

start = clock();

  float *PrimeList_h, *PrimeList_d;  // Pointer to host & device arrays

  const int N = 100000;  // Number of elements in arrays

  size_t size = N * sizeof(float);

  PrimeList_h = (float *)malloc(size);        // Allocate array on host

  cudaMalloc((void **) &PrimeList_d, size);   // Allocate array on device

  int block_size = 256;

  int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);

dim3 dimBlock(block_size, block_size);

      dim3 dimGrid(n_blocks, n_blocks);

  primalrage <<< dimGrid, dimBlock >>> (PrimeList_d, N);

  cudaMemcpy(PrimeList_h, PrimeList_d, sizeof(float)*N, cudaMemcpyDeviceToHost);

  for (int i=0; i<N; i++) 

  if (PrimeList_h[i] !=0){

     printf("%d %f\n", i, PrimeList_h[i]);

}

  free(PrimeList_h); cudaFree(PrimeList_d);

diff = (clock() - start) / (double)CLOCKS_PER_SEC;

      printf("Time: %f /n",diff);

}[/codebox]

This is simply an example to show a speed up over a serialized version of the code.

Any input would be greatly appreciated.

EDIT: I have updated this post to reflect a few of the changes I have made based on your suggestions

Hey Cuda Community,

I have been running this code to generate a massive number of primes up to a certain number in a brute force manner. My goal is to get this code running on every machine in a computer lab (all machines are identical). The code I currently have runs correctly on a few machines, but I am encountering a number of errors on the rest of them:

  • The program prints all primes up to a certain number which is lower than the desired maximum

  • The program prints out many numbers, most of which not prime, up to the maximum (I find this one most odd, as when the program does this the values in PrimeList_h are all 0, however they still print even though printf is within if(PrimeList_h[i] != 0)

  • Combination of the last two

The code is as follows:

[codebox]

#include <stdio.h>

#include <math.h> 

#include <cuda.h>

#include <cstdio>

    #include <ctime>

__global__ void primalrage(float *PrimeList, int N)

{

  int i = blockIdx.x * blockDim.x + threadIdx.x;

  int n = 1; 

  for(int j = 2; j < i; j++)

  {

  if (i % j == 0) 

  { n = 0; }

  }

  if (n == 1)

  {

  PrimeList[i] = i;

  }

  else

  PrimeList[i] = 0;

 }

int main(void)

{

double temp;

clock_t start;

double diff;

start = clock();

  float *PrimeList_h, *PrimeList_d;  // Pointer to host & device arrays

  const int N = 100000;  // Number of elements in arrays

  size_t size = N * sizeof(float);

  PrimeList_h = (float *)malloc(size);        // Allocate array on host

  cudaMalloc((void **) &PrimeList_d, size);   // Allocate array on device

  int block_size = 256;

  int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);

dim3 dimBlock(block_size, block_size);

      dim3 dimGrid(n_blocks, n_blocks);

  primalrage <<< dimGrid, dimBlock >>> (PrimeList_d, N);

  cudaMemcpy(PrimeList_h, PrimeList_d, sizeof(float)*N, cudaMemcpyDeviceToHost);

  for (int i=0; i<N; i++) 

  if (PrimeList_h[i] !=0){

     printf("%d %f\n", i, PrimeList_h[i]);

}

  free(PrimeList_h); cudaFree(PrimeList_d);

diff = (clock() - start) / (double)CLOCKS_PER_SEC;

      printf("Time: %f /n",diff);

}[/codebox]

This is simply an example to show a speed up over a serialized version of the code.

Any input would be greatly appreciated.

EDIT: I have updated this post to reflect a few of the changes I have made based on your suggestions

Block size 1 is anyway bad number, maybe it is limit of block number, 100000 looks too big.
Probably max x-numbers of block is 65535.

Block size 1 is anyway bad number, maybe it is limit of block number, 100000 looks too big.
Probably max x-numbers of block is 65535.

Hey Lev,

I have had some interesting issues with block_size. cudaDeviceProp told me my max threads per block is 512. Oddly enough, when I run this code with block_size = 1 it runs fine (primes up to 100,000 in less than 1 second), and when I use a block_size > 512 it also runs fine, but any block_size lower than 512 and the program does something for about 10 seconds, then outputs the runtime and no primes.

Aside from that, I am trying to have this code up and running on a number of machines in a computer lab (all are identical). The machine I developed this code on runs fine most of the time. I am getting the errors I mentioned before when I run the exact same code on every other machine in the lab. The error itself isn’t even consistent.

Hey Lev,

I have had some interesting issues with block_size. cudaDeviceProp told me my max threads per block is 512. Oddly enough, when I run this code with block_size = 1 it runs fine (primes up to 100,000 in less than 1 second), and when I use a block_size > 512 it also runs fine, but any block_size lower than 512 and the program does something for about 10 seconds, then outputs the runtime and no primes.

Aside from that, I am trying to have this code up and running on a number of machines in a computer lab (all are identical). The machine I developed this code on runs fine most of the time. I am getting the errors I mentioned before when I run the exact same code on every other machine in the lab. The error itself isn’t even consistent.

Your code does out-of-bounds accesses if [font=“Courier New”]N[/font] is not a multiple of the blocksize. This might explain your trouble with blocksizes other than 1.

Your code does out-of-bounds accesses if [font=“Courier New”]N[/font] is not a multiple of the blocksize. This might explain your trouble with blocksizes other than 1.

Hey everyone,

Thanks for all your help so far, but I’m not quite there yet.

I updated my code so it uses dim3:

[codebox]int block_size = 256;

int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);

dim3 dimBlock(block_size, block_size);

dim3 dimGrid(n_blocks, n_blocks);

primalrage <<< dimGrid, dimBlock >>> (PrimeList_d, N);[/codebox]

And now I am not having any issues when I change block_size .

Nevertheless, the code still doesn’t run correctly on all the machines in my lab. I still get different errors on different computers:

  • The program prints all primes up to a certain number which is lower than the desired maximum

  • The program prints out many numbers, most of which not prime, up to the maximum (I find this one most odd, as when the program does this the values in PrimeList_h are all 0, however they still print even though printf is within if(PrimeList_h[i] != 0)

  • Combination of the last two

I have updated the original post to reflect these changes.

Hey everyone,

Thanks for all your help so far, but I’m not quite there yet.

I updated my code so it uses dim3:

[codebox]int block_size = 256;

int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);

dim3 dimBlock(block_size, block_size);

dim3 dimGrid(n_blocks, n_blocks);

primalrage <<< dimGrid, dimBlock >>> (PrimeList_d, N);[/codebox]

And now I am not having any issues when I change block_size .

Nevertheless, the code still doesn’t run correctly on all the machines in my lab. I still get different errors on different computers:

  • The program prints all primes up to a certain number which is lower than the desired maximum

  • The program prints out many numbers, most of which not prime, up to the maximum (I find this one most odd, as when the program does this the values in PrimeList_h are all 0, however they still print even though printf is within if(PrimeList_h[i] != 0)

  • Combination of the last two

I have updated the original post to reflect these changes.

Check the return code for errors. In the new example it is guaranteed that your kernel does not launch at all, as a blocksize of 256x256 is far too large.

Insert[font=“Courier New”]
cudaThreadSynchronize();
printf(“%s\n”, cudaGetErrorString(cudaGetLastError()));[/font]
after the kernel call.

Check the return code for errors. In the new example it is guaranteed that your kernel does not launch at all, as a blocksize of 256x256 is far too large.

Insert[font=“Courier New”]
cudaThreadSynchronize();
printf(“%s\n”, cudaGetErrorString(cudaGetLastError()));[/font]
after the kernel call.

Code is spoiled even more. need to write

dim3 dimBlock(block_size, 1);
dim3 dimGrid(n_blocks, 1);

and Terra is right, iif you launch 1000 blocks of 256, and N is not 1000*256, you get out of bound access.

Code is spoiled even more. need to write

dim3 dimBlock(block_size, 1);
dim3 dimGrid(n_blocks, 1);

and Terra is right, iif you launch 1000 blocks of 256, and N is not 1000*256, you get out of bound access.

If you’re looking for suggestions on how to generate a lot of small (64-bits or less, say) primes with a GPU, allow me to point you to MersenneForum. :)

If you’re looking for suggestions on how to generate a lot of small (64-bits or less, say) primes with a GPU, allow me to point you to MersenneForum. :)