Still converting CPU serial algorithms to hybrid CUDA implementations, but not having much success with this one;

README:

https://github.com/OlegKonings/CUDA_Sieve_of_Eratosthenes/blob/master/README.md

CODE:

https://github.com/OlegKonings/CUDA_Sieve_of_Eratosthenes/blob/master/EXP3/EXP3/EXP3.cu

As usual I posted both CPU and GPU versions, so people can see how I implemented. This one was a bit disappointing with only 6x-10x speed increase over serial CPU version.

I am not interested in using any library, rather would like some ideas on how to speed up this version.

Thanks!