CUDA - Sieve of Eratosthenes division into parts

bolok25 · June 21, 2015, 5:47pm

I’m writing implementation of Sieve of Eratosthenes (Sieve of Eratosthenes - Wikipedia) on GPU. But no sth like this - Developer Resource: Cuda - Sieve of Eratosthenes

Method:

Creating n-element array with default values 0/1 (0 - prime, 1 - no) and passing it on GPU (I know that it can be done directly in kernel but it’s not problem in this moment).
Each thread in block checks multiples of a single number. Each block checks in total sqrt(n) possibilities. Each block == different interval.
Marking multiples as 1 and passing data back to the host.

Code:

#include <stdio.h>
#include <stdlib.h>
#define THREADS 1024

__global__ void kernel(int *global, int threads) {
	extern __shared__ int cache[];

	int tid = threadIdx.x + 1;
	int offset = blockIdx.x * blockDim.x;
	int number = offset + tid;

	cache[tid - 1] = global[number];
	__syncthreads();

	int start = offset + 1;
	int end = offset + threads;

	for (int i = start; i <= end; i++) {
		if ((i != tid) && (tid != 1) && (i % tid == 0)) {
			cache[i - offset - 1] = 1;
		}
	}
	__syncthreads();
	global[number] = cache[tid - 1];
}

int main(int argc, char *argv[]) {
	int *array, *dev_array;
	int n = atol(argv[1]);
	int n_sqrt = floor(sqrt((double)n));

	size_t array_size = n * sizeof(int);
	array = (int*) malloc(n * sizeof(int));
	array[0] = 1;
	array[1] = 1;
	for (int i = 2; i < n; i++) {
		array[i] = 0;
	}

	cudaMalloc((void**)&dev_array, array_size);
	cudaMemcpy(dev_array, array, array_size, cudaMemcpyHostToDevice);

	int threads = min(n_sqrt, THREADS);
	int blocks = n / threads;
	int shared = threads * sizeof(int);
	kernel<<<blocks, threads, shared>>>(dev_array, threads);
	cudaMemcpy(array, dev_array, array_size, cudaMemcpyDeviceToHost);

	int count = 0;
	for (int i = 0; i < n; i++) {
		if (array[i] == 0) {
			count++;
		}
	}
	printf("Count: %d\n", count);
	return 0;
}

Run:
./sieve 10240000

It works correctly when n = 16, 64, 1024, 102400… but for n = 10240000 I getting incorrect result. Where is problem?

Robert_Crovella · June 21, 2015, 8:05pm

cross posted:

[url]c - CUDA - Sieve of Eratosthenes division into parts - Stack Overflow

CudaaduC · June 21, 2015, 8:36pm

Not sure what you are trying to do in that code, but here is a kindergarten level implementation which is only a few times faster than a CPU version;

http://pastebin.com/jg9FBUZa

I have a much faster version but that should get you started.

When compared to an overclocked 4.5 GHz i7 for 2^29 primes;

CPU solution timing: 4938
Capable!
CUDA timing: 1294

Success! CPU and GPU primes cache of all numbers up to 536870912 (inclusive) were equa

The Hybrid CUDA implementation was 3.81607 faster than the serial CPU implementation.

bolok25 · June 22, 2015, 2:23pm

@txbob, yes but what’s the problem?

@CudaaduC, thx but I must do it using shared memory. See my updated post. It almost works but not for all n. If you could test it I would be grateful.

Robert_Crovella · June 22, 2015, 3:03pm

I didn’t say there was a problem. I point out the cross-posting so that others who wish to comment here may also review the comments on the cross-posting.

bolok25 · June 22, 2015, 3:47pm

@txbob, I understood you wrong, sorry.

CudaaduC · June 22, 2015, 7:24pm

Unless you have some particular insight to this problem, I am not sure that using dynamically allocated shared memory is necessary.

Global update may need to be atomic, and you should run through a racecheck to verify;

http://docs.nvidia.com/cuda/cuda-memcheck/index.html#axzz3dow395LC

Topic		Replies	Views
[code review] my 1st CUDA program - need feedback - Sieve of Eratosthenes CUDA Programming and Performance	2	1577	July 17, 2014
Sieve of Erastothenes CUDA Programming and Performance	28	50228	July 2, 2012
Benchmark CUDA Code Sieve of Eratosthenes, anyone? CUDA Programming and Performance	2	8373	December 9, 2009
[solved] Why can't I compute primes past 85 million? (sieve of eratosthenes) CUDA Programming and Performance	2	990	July 17, 2014
Need help to speedup CUDA Hybrid version of Sieve of Eratothenes CUDA Programming and Performance	4	724	September 18, 2013
GPU sieving help Seeking GPU programming help for the sieves at PrimeGrid CUDA Programming and Performance	8	4225	March 26, 2009
Problem with CUDA code,overflow? CUDA Programming and Performance	9	1210	September 10, 2020
CUDA Prime factorization CUDA Programming and Performance	1	1947	January 25, 2020
Slow shared memory access and incorrect parallel reduction results CUDA Programming and Performance	3	997	October 5, 2011
The Problem with Primes CUDA Programming and Performance	15	6585	October 10, 2010

CUDA - Sieve of Eratosthenes division into parts

Related topics