I was wondering how big is the overhead of the device emulation mode on the piece of a cuda code compared to the code which was written directly for CPU.
The reason I ask is that I have a piece of code which I wrote for GPU, and I have no CPU version to benchmark against. Therefore, I use device emulation mode to benchmark CPU vs GPU performance of the code.
I played with some examples from CUDA_SDK, and found that in the case of N-body code, computing mutual grav.forces of N particles has very little overhead. The performance is somewhere between Intel C and gcc compilers if the similar piece of code is written for CPU.
Yet comparing RadixSort from particles CUDA_SDK example, I found that in device emulation mode the code is about 4 order of magnitude slower than the code which runs directly on GPU.
I am quite curious what generally produces the most overhead in compuations in the device emulation mode: share memory, thread parallelization, etc, etc.
And what’s the problem of porting your code to CPU and compiling with good compiler and reasonable optimizations? It won’t be difficult to do GPU->CPU conversion IMO.
Device emulation works by creating CPU thread for each thread in your grid and running them one by one. So, I doubt such approach will give satisfactory performance (especially if kernel running time is small). I definitely wouldn’t use device emulation for benchmarking CPU code.
I’d say time. The code is quite big, and porting it to CPU will be a quite time consuming task w/o much return, especially if it is going to be used for benchmarking purpose only.
Well, this is in fact my question: Under which circumstances, if anybody knows, this overhead is minimal/maximal. As I mentioned above, I run into few examples, such as N-body & RadixSort in Particles from NVIDIA_CUDA_SDK, where the overhead is minimal (N-body) or large (RadixSort). Nevertheless, I failed to figure out the cause of this difference.
I have a few pieces of computation/bandwidth dominated cuda codes which are as fast in the device emulation mode as their optimised CPU counterparts.
Moreover, some of my cuda codes which have no CPU version perform just 50-100x slower in the device emulation mode than when run directly on the device.
Therefore, the claim than device emulation mode is 2000x slower than GPU is not general. But I do have at least one example of a cuda code which is ridiculously slow in the device emulation mode (2000-10000x slower). Therefore, I am curious under which circumstances this occurs.
I profiled cuda codes in device emulation mode, and it does not seem that there is much of the CUDA overhead, but I am not sure if “-pg” flag is properly processed with nvcc compiler. So, I am quite puzzled in finding solution to this problem.
Those high factors occur if the occupancy is high and your code is very suited to parallel implementation. Creating a zillion threads on the CPU will be much slower.
(and this doesn’t mean an optimal CPU implementation will be slow, just that the nvidia emulator is slow in this case, so it’s a bad benchmark)
BTW there was someone that did a more efficient C-level CUDA emulator, I remember, somewhere on the forums. Maybe it’s a good idea to look that up.
I’ve always though that at most you can run 768 threads in parallel, which means you every batch will have the same overhead of creation of at most 768 threads. In addition, I think it should not matter how many threads you have as every thread will always come with its own starting/exit overhead. So whether you are running 100 or zillion of threads, the performance impact should be of the same percentage, in theory
I’ve checked the cuda codes, and all them use number or registers and shared memory and block size such that the occupancy is about 25-50%.
Could it be something much more subtle?
I’ll have a look through forums. Thanks for the tip!
One key kernel in my app:
GPU: ~ 2 ms
CPU: ~ 0.2 s
Emulation: ~1.0 s
So, in this particular case it is only 5 times slower.
Emulation mode is meant as a debugging tool, not an efficient implementation for production code. I would guess that your examples that perform similarly to CPU optimize code probably execute a very small grid => little thread overhead. In general, the emulator is going to be very slow compared to optimized CPU code. And the larger the grid size, the more overhead.
I don’t think the profiler will show the thread scheduling overhead, but I might be wrong. And you will get varying behaviors on linux and windows. I have some threaded code that performs perfectly fine in linux and windows when threads=cores. For fun, if I increase the number of thread > number of cores, the performance is the same on linux but decreases greatly in windows: I presume due to inefficiencies in the windows thread scheduler. With GPU emulation mode, hundreds to thousands of threads are being launched!
Just listen to us and never trust the emulation mode as a reliable performance estimate. If writing CPU code just for benchmarking purposes is too time consuming, then don’t do it. You can tell how efficiently you are using the device by calculating your effective GFLOP and GB/s rates.
Funny! Was not aware of this Windows feature. For fun, I’ve just carried out few tests, and found that whether I have 2 threads ( = cores) per block in the device emulation mode or 128 (total number of threads is 102656), the performance is the same, while if run on the device the performance drops by a factor of 4 in the case of 2 threads per block. The system is Debian 2.6.21 x86_64. Have no access to windows PC to check this. So threads hypothesis could be safely ruled out for the reason I’ve described in my previous post, at least on Linux and for threads with large enough execution time. Otherwise, the dominant contribution will not be from the execution of a thread by due to overhead of creating the thread. Please correct me if I am wrong.
— Update —
I found a small piece of my cuda coda which executes 102656 threads in blocks of 128, and ratio of device_emulation/GPU is about 1000. A thread consist of just 100 flops only!
It appears then, as this was pointed out above, the thread launch/exit could be of huge overhead in the device emulation mode, if the thread execution time is too short.
Oh, I get about 100GFLOPs, and 20-30GB/s so I am quite happy.