device emulation mode C

egaburov · February 27, 2008, 6:14pm

Dear All,

I was wondering how big is the overhead of the device emulation mode on the piece of a cuda code compared to the code which was written directly for CPU.

The reason I ask is that I have a piece of code which I wrote for GPU, and I have no CPU version to benchmark against. Therefore, I use device emulation mode to benchmark CPU vs GPU performance of the code.

I played with some examples from CUDA_SDK, and found that in the case of N-body code, computing mutual grav.forces of N particles has very little overhead. The performance is somewhere between Intel C and gcc compilers if the similar piece of code is written for CPU.

Yet comparing RadixSort from particles CUDA_SDK example, I found that in device emulation mode the code is about 4 order of magnitude slower than the code which runs directly on GPU.

I am quite curious what generally produces the most overhead in compuations in the device emulation mode: share memory, thread parallelization, etc, etc.

Thanks you all for help,
Evghenii

AndreiB · February 27, 2008, 7:17pm

And what’s the problem of porting your code to CPU and compiling with good compiler and reasonable optimizations? It won’t be difficult to do GPU->CPU conversion IMO.

Device emulation works by creating CPU thread for each thread in your grid and running them one by one. So, I doubt such approach will give satisfactory performance (especially if kernel running time is small). I definitely wouldn’t use device emulation for benchmarking CPU code.

egaburov · February 27, 2008, 7:36pm

I’d say time. The code is quite big, and porting it to CPU will be a quite time consuming task w/o much return, especially if it is going to be used for benchmarking purpose only.

Well, this is in fact my question: Under which circumstances, if anybody knows, this overhead is minimal/maximal. As I mentioned above, I run into few examples, such as N-body & RadixSort in Particles from NVIDIA_CUDA_SDK, where the overhead is minimal (N-body) or large (RadixSort). Nevertheless, I failed to figure out the cause of this difference.

Any help on this will be the most welcome.

wumpus · February 27, 2008, 8:03pm

The overhead is very very large, like 2000x or more slower than GPU. I wouldn’t recommend it as a CPU benchmark.

egaburov · February 27, 2008, 8:51pm

This is not quite true.

I have a few pieces of computation/bandwidth dominated cuda codes which are as fast in the device emulation mode as their optimised CPU counterparts.

Moreover, some of my cuda codes which have no CPU version perform just 50-100x slower in the device emulation mode than when run directly on the device.

Therefore, the claim than device emulation mode is 2000x slower than GPU is not general. But I do have at least one example of a cuda code which is ridiculously slow in the device emulation mode (2000-10000x slower). Therefore, I am curious under which circumstances this occurs.

I profiled cuda codes in device emulation mode, and it does not seem that there is much of the CUDA overhead, but I am not sure if “-pg” flag is properly processed with nvcc compiler. So, I am quite puzzled in finding solution to this problem.

Evghenii

wumpus · February 27, 2008, 10:00pm

Those high factors occur if the occupancy is high and your code is very suited to parallel implementation. Creating a zillion threads on the CPU will be much slower.
(and this doesn’t mean an optimal CPU implementation will be slow, just that the nvidia emulator is slow in this case, so it’s a bad benchmark)

BTW there was someone that did a more efficient C-level CUDA emulator, I remember, somewhere on the forums. Maybe it’s a good idea to look that up.

egaburov · February 27, 2008, 10:11pm

I’ve always though that at most you can run 768 threads in parallel, which means you every batch will have the same overhead of creation of at most 768 threads. In addition, I think it should not matter how many threads you have as every thread will always come with its own starting/exit overhead. So whether you are running 100 or zillion of threads, the performance impact should be of the same percentage, in theory

I’ve checked the cuda codes, and all them use number or registers and shared memory and block size such that the occupancy is about 25-50%.

Could it be something much more subtle?

I’ll have a look through forums. Thanks for the tip!

Evghenii

MisterAnderson42 · February 27, 2008, 10:13pm

One key kernel in my app:
GPU: ~ 2 ms
CPU: ~ 0.2 s
Emulation: ~1.0 s

So, in this particular case it is only 5 times slower.

Emulation mode is meant as a debugging tool, not an efficient implementation for production code. I would guess that your examples that perform similarly to CPU optimize code probably execute a very small grid => little thread overhead. In general, the emulator is going to be very slow compared to optimized CPU code. And the larger the grid size, the more overhead.

I don’t think the profiler will show the thread scheduling overhead, but I might be wrong. And you will get varying behaviors on linux and windows. I have some threaded code that performs perfectly fine in linux and windows when threads=cores. For fun, if I increase the number of thread > number of cores, the performance is the same on linux but decreases greatly in windows: I presume due to inefficiencies in the windows thread scheduler. With GPU emulation mode, hundreds to thousands of threads are being launched!

Just listen to us and never trust the emulation mode as a reliable performance estimate. If writing CPU code just for benchmarking purposes is too time consuming, then don’t do it. You can tell how efficiently you are using the device by calculating your effective GFLOP and GB/s rates.

egaburov · February 27, 2008, 10:27pm

Are these number for windows or for linux?

Emulation mode is meant as a debugging tool, not an efficient implementation for production code. I would guess that your examples that perform similarly to CPU optimize code probably execute a very small grid => little thread overhead. In general, the emulator is going to be very slow compared to optimized CPU code. And the larger the grid size, the more overhead.

I don’t think the profiler will show the thread scheduling overhead, but I might be wrong. And you will get varying behaviors on linux and windows. I have some threaded code that performs perfectly fine in linux and windows when threads=cores. For fun, if I increase the number of thread > number of cores, the performance is the same on linux but decreases greatly in windows: I presume due to inefficiencies in the windows thread scheduler. With GPU emulation mode, hundreds to thousands of threads are being launched!

Funny! Was not aware of this Windows feature. For fun, I’ve just carried out few tests, and found that whether I have 2 threads ( = cores) per block in the device emulation mode or 128 (total number of threads is 102656), the performance is the same, while if run on the device the performance drops by a factor of 4 in the case of 2 threads per block. The system is Debian 2.6.21 x86_64. Have no access to windows PC to check this. So threads hypothesis could be safely ruled out for the reason I’ve described in my previous post, at least on Linux and for threads with large enough execution time. Otherwise, the dominant contribution will not be from the execution of a thread by due to overhead of creating the thread. Please correct me if I am wrong.

— Update —

I found a small piece of my cuda coda which executes 102656 threads in blocks of 128, and ratio of device_emulation/GPU is about 1000. A thread consist of just 100 flops only!

It appears then, as this was pointed out above, the thread launch/exit could be of huge overhead in the device emulation mode, if the thread execution time is too short.

Oh, I get about 100GFLOPs, and 20-30GB/s so I am quite happy.

Evghenii

MisterAnderson42 · February 27, 2008, 10:29pm

My benchmark was on linux. Maybe I’ll try it on windows tomorrow for fun.

Topic		Replies	Views
Multicore CPU to emulate CUDA device Utilize multi-core CPU to speed emulation? CUDA Programming and Performance	11	2939	August 3, 2009
well how do I know if cuda runs on the gpu CUDA Programming and Performance	20	13511	July 9, 2008
Fast DIy device emulation Introductory howto CUDA Programming and Performance	9	7943	June 28, 2008
device speed vs. host speed Why is my device program so slow? CUDA Programming and Performance	8	7893	August 16, 2007
device emulation faster than gpu CUDA Programming and Performance	2	986	March 20, 2009
need a help from employees or guys who know compiler well CUDA Programming and Performance	22	8631	December 18, 2008
Cuda code performance CUDA Programming and Performance	14	3170	December 16, 2014
How or do you use emulation mode CUDA Programming and Performance	12	6812	September 11, 2008
Wish List for next OpenCL release CUDA Programming and Performance	9	17436	September 9, 2009
cuda phylosophy is that really C? CUDA Programming and Performance	12	8227	May 6, 2008

device emulation mode C

Related topics