I was wondering how big is the overhead of the device emulation mode on the piece of a cuda code compared to the code which was written directly for CPU.
The reason I ask is that I have a piece of code which I wrote for GPU, and I have no CPU version to benchmark against. Therefore, I use device emulation mode to benchmark CPU vs GPU performance of the code.
I played with some examples from CUDA_SDK, and found that in the case of N-body code, computing mutual grav.forces of N particles has very little overhead. The performance is somewhere between Intel C and gcc compilers if the similar piece of code is written for CPU.
Yet comparing RadixSort from particles CUDA_SDK example, I found that in device emulation mode the code is about 4 order of magnitude slower than the code which runs directly on GPU.
I am quite curious what generally produces the most overhead in compuations in the device emulation mode: share memory, thread parallelization, etc, etc.
Thanks you all for help,