Multicore CPU to emulate CUDA device Utilize multi-core CPU to speed emulation?

coggrc · July 28, 2009, 8:31pm

I am interested in sometimes running CUDA code in emulation mode for debug and other purposes, when CUDA hardware is not available. So far it appears that my emulation does not take advantage of the multi-core CPU (I have quad core Xeon, and it is 25% busy while running my CUDA program). I also tried to use the nvcc --multicore option, but this does not seem to be implemented yet? (I am using CUDA version 2.2).

Any thoughts on taking advantage of my multi-core CPU when emulating?

kyzhao · July 29, 2009, 2:38am

add -g command?
need more information

coggrc · July 29, 2009, 4:00am

i added -g to the nvcc command line, this made no difference. what does -g do?

i don’t know what more to tell you… i just want to compile an emulated CUDA program that takes advantage of my quad core processor. right now, only one core of my CPU is used for emulation, and the others remain idle. would be nice to have a 4X speedup in some cases.

Sylvain_Collange · July 29, 2009, 4:01pm

The CUDA emulation mode was not really designed to be efficient anyway…

Even our instruction-level simulator is often faster on one core, plus it scales nicely on multi-cores (around 3.6 speedup on 4 cores). :)

parallelis · July 29, 2009, 5:22pm

Worse than not being efficient, it’s performance level isn’t correlated with CUDA GPU execution performance, so an algorithm that may be faster on emulation mode (than another) may be slower on real GPU execution.

Emulation mode was just conceived to enable development, debugging (not totally!) and testing on CUDA-disabled computer instead of beching algorithm or using real-world CUDA application.

For myself, if you could afford on a dsktop comuter, buy any CUDA-enabled GPU (even GeForce 8400 will do better than any CPU in emulation mode!), and on a notebook, carefully choose it to be able to run CUDA natively (knowing that GeForce 9400M IGP is largely enough for tests and development!).

And as I am a Mac guy, if your budget is tight, the White MacBook is really sufficient to develop CUDA programs :-)

coggrc · August 2, 2009, 1:45am

i’m quite aware of the emulation performance limitations. i am not interested in benchmarking anything in emulation mode. i sometimes must use remote connection over windows, the GPU is disabled, and emulation is my only option at that point. the kernels i am working with take 30 seconds to run (on the GPU it runs in 30 msec). it is just inconvenient to wait 30 seconds for a kernel. since i have the extra horsepower in my CPU, i’d like to use it. it would be nice if there was a way to not have remote connection disable the GPU, that would be my first choice.

Gregory_Diamos · August 2, 2009, 7:00pm

First, the default emulator is multithreaded. It is not efficient (it literally launches one pthread per CUDA thread), but you might already be getting a speedup from your multicore setup. Second, a 4x speedup isn’t going to help you. At best, you will get down to ~8 seconds, which is still dramatically slower than the 30ms GPU time.

Rather than multithreading the code, you could try turning on optimizations in your compiler (-O3), and turn off debugging symbols (remove -g) which add overhead. You can get anywhere between 1-20x performance improvement by doing this.

Otherwise, just go buy a cheap video card and put it in your dev machine. Even a $50 CUDA capable card will be significantly faster than emulation.

Sylvain_Collange · August 3, 2009, 1:07pm

Actually it seems to use user-level threads, which don’t take advantage of multi-core hardware at all.

I suspect this was done to reduce the overhead of the context switches.

Context switches is what makes the performance of the emulation mode terrible. Running hundreds of pthreads (per block) with synchronization barriers every few instructions inside the inner loop (as most CUDA apps do) is not really going to be efficient…

That is why it is almost always slower than instruction-level emulation at the warp/block level as we do.

I like to boast that Barra is faster than the emulation mode, but it would be more accurate to say that the emulation mode is slow… ;)

But this doesn’t really help when 95% of the time is spent inside the pthread context switch library call. :)

Gregory_Diamos · August 3, 2009, 3:43pm

For some reason I seem to remember actually getting a speedup using the emulator on a multicore machine, but I just tested it now and you are absolutely right. Only one OS thread at a time.

Yes, this only would matter if the thread bodies were much more significant than the context switch overhead.

seibert · August 3, 2009, 5:27pm

VNC does not disable the GPU because it does not replace your video driver when active like Remote Desktop does.

_Big_Mac · August 3, 2009, 11:33pm

Has anyone tried MCUDA?

_Big_Mac · August 3, 2009, 11:36pm

[double post]

Topic		Replies	Views
device emulation mode C CUDA Programming and Performance	9	6448	February 27, 2008
Fast DIy device emulation Introductory howto CUDA Programming and Performance	9	7941	June 28, 2008
How or do you use emulation mode CUDA Programming and Performance	12	6811	September 11, 2008
Is Emulation multithreaded? I wonder if it's not mono-thread CUDA Programming and Performance	10	4168	August 28, 2008
Wish List for next OpenCL release CUDA Programming and Performance	9	17433	September 9, 2009
well how do I know if cuda runs on the gpu CUDA Programming and Performance	20	13451	July 9, 2008
CPU Support? CUDA Programming and Performance	15	3053	May 4, 2009
For enthusiasts Future CUDA versions CUDA Programming and Performance	3	2718	June 10, 2008
device speed vs. host speed Why is my device program so slow? CUDA Programming and Performance	8	7892	August 16, 2007
A naive question on emulation mode CUDA Programming and Performance	2	2704	September 17, 2008

Multicore CPU to emulate CUDA device Utilize multi-core CPU to speed emulation?

Related topics