CUDA - Sieve of Eratosthenes division into parts

I’m writing implementation of Sieve of Eratosthenes ( on GPU. But no sth like this -


  1. Creating n-element array with default values 0/1 (0 - prime, 1 - no) and passing it on GPU (I know that it can be done directly in kernel but it’s not problem in this moment).
  2. Each thread in block checks multiples of a single number. Each block checks in total sqrt(n) possibilities. Each block == different interval.
  3. Marking multiples as 1 and passing data back to the host.


#include <stdio.h>
#include <stdlib.h>
#define THREADS 1024

__global__ void kernel(int *global, int threads) {
	extern __shared__ int cache[];

	int tid = threadIdx.x + 1;
	int offset = blockIdx.x * blockDim.x;
	int number = offset + tid;

	cache[tid - 1] = global[number];

	int start = offset + 1;
	int end = offset + threads;

	for (int i = start; i <= end; i++) {
		if ((i != tid) && (tid != 1) && (i % tid == 0)) {
			cache[i - offset - 1] = 1;
	global[number] = cache[tid - 1];

int main(int argc, char *argv[]) {
	int *array, *dev_array;
	int n = atol(argv[1]);
	int n_sqrt = floor(sqrt((double)n));

	size_t array_size = n * sizeof(int);
	array = (int*) malloc(n * sizeof(int));
	array[0] = 1;
	array[1] = 1;
	for (int i = 2; i < n; i++) {
		array[i] = 0;

	cudaMalloc((void**)&dev_array, array_size);
	cudaMemcpy(dev_array, array, array_size, cudaMemcpyHostToDevice);

	int threads = min(n_sqrt, THREADS);
	int blocks = n / threads;
	int shared = threads * sizeof(int);
	kernel<<<blocks, threads, shared>>>(dev_array, threads);
	cudaMemcpy(array, dev_array, array_size, cudaMemcpyDeviceToHost);

	int count = 0;
	for (int i = 0; i < n; i++) {
		if (array[i] == 0) {
	printf("Count: %d\n", count);
	return 0;

./sieve 10240000

It works correctly when n = 16, 64, 1024, 102400… but for n = 10240000 I getting incorrect result. Where is problem?

cross posted:

Not sure what you are trying to do in that code, but here is a kindergarten level implementation which is only a few times faster than a CPU version;

I have a much faster version but that should get you started.

When compared to an overclocked 4.5 GHz i7 for 2^29 primes;

CPU solution timing: 4938
CUDA timing: 1294

Success! CPU and GPU primes cache of all numbers up to 536870912 (inclusive) were equa

The Hybrid CUDA implementation was 3.81607 faster than the serial CPU implementation.

@txbob, yes but what’s the problem?

@CudaaduC, thx but I must do it using shared memory. See my updated post. It almost works but not for all n. If you could test it I would be grateful.

I didn’t say there was a problem. I point out the cross-posting so that others who wish to comment here may also review the comments on the cross-posting.

@txbob, I understood you wrong, sorry.

Unless you have some particular insight to this problem, I am not sure that using dynamically allocated shared memory is necessary.

Global update may need to be atomic, and you should run through a racecheck to verify;