Need help to speedup CUDA Hybrid version of Sieve of Eratothenes

Still converting CPU serial algorithms to hybrid CUDA implementations, but not having much success with this one;



As usual I posted both CPU and GPU versions, so people can see how I implemented. This one was a bit disappointing with only 6x-10x speed increase over serial CPU version.

I am not interested in using any library, rather would like some ideas on how to speed up this version.


Sorry, I don’t have time to look at the code, but wouldn’t this be largely limited by global memory bandwidth? This is about 3.5x higher on a K20 vs a four-channel Xeon platform. The sieve requires to mark numbers, presumably represented by indiviual bits in several GB of memory. One can of course use caching techniques but with the relatively small amount of shared memory there is only so much one can do in that regard.

Last time I did this sieve I posted code to this forum - just use the search feature. I remember using shared memory and I used bit masks to mark the divisible numbers.


Yes, the global memory bandwidth seems to be the issue in this case.

Thanks! I will look at your approach. For my first attempt I used the uchar1 type, so can only cache about 1,073,741,824 numbers at this point.

Switching to a bit representation will of course reduce bandwidth requirements, but it will do so equally for the GPU and he CPU implementations. Since the computational density of the sieve just is not very high, my expectation would be that it will remain memory bound, meaning the CPU / GPU performance ratio will tend to follow the relative memory throughputs: about 7-8x for host with dual memory channels and 3.5-4x for a host platform with quadruple memory channels. A bit higher if you turn off ECC on the GPU.