Multicore CPU to emulate CUDA device Utilize multi-core CPU to speed emulation?

I am interested in sometimes running CUDA code in emulation mode for debug and other purposes, when CUDA hardware is not available. So far it appears that my emulation does not take advantage of the multi-core CPU (I have quad core Xeon, and it is 25% busy while running my CUDA program). I also tried to use the nvcc --multicore option, but this does not seem to be implemented yet? (I am using CUDA version 2.2).

Any thoughts on taking advantage of my multi-core CPU when emulating?

add -g command?
need more information

i added -g to the nvcc command line, this made no difference. what does -g do?

i don’t know what more to tell you… i just want to compile an emulated CUDA program that takes advantage of my quad core processor. right now, only one core of my CPU is used for emulation, and the others remain idle. would be nice to have a 4X speedup in some cases.

The CUDA emulation mode was not really designed to be efficient anyway…

Even our instruction-level simulator is often faster on one core, plus it scales nicely on multi-cores (around 3.6 speedup on 4 cores). :)

Worse than not being efficient, it’s performance level isn’t correlated with CUDA GPU execution performance, so an algorithm that may be faster on emulation mode (than another) may be slower on real GPU execution.

Emulation mode was just conceived to enable development, debugging (not totally!) and testing on CUDA-disabled computer instead of beching algorithm or using real-world CUDA application.

For myself, if you could afford on a dsktop comuter, buy any CUDA-enabled GPU (even GeForce 8400 will do better than any CPU in emulation mode!), and on a notebook, carefully choose it to be able to run CUDA natively (knowing that GeForce 9400M IGP is largely enough for tests and development!).

And as I am a Mac guy, if your budget is tight, the White MacBook is really sufficient to develop CUDA programs :-)

i’m quite aware of the emulation performance limitations. i am not interested in benchmarking anything in emulation mode. i sometimes must use remote connection over windows, the GPU is disabled, and emulation is my only option at that point. the kernels i am working with take 30 seconds to run (on the GPU it runs in 30 msec). it is just inconvenient to wait 30 seconds for a kernel. since i have the extra horsepower in my CPU, i’d like to use it. it would be nice if there was a way to not have remote connection disable the GPU, that would be my first choice.

First, the default emulator is multithreaded. It is not efficient (it literally launches one pthread per CUDA thread), but you might already be getting a speedup from your multicore setup. Second, a 4x speedup isn’t going to help you. At best, you will get down to ~8 seconds, which is still dramatically slower than the 30ms GPU time.

Rather than multithreading the code, you could try turning on optimizations in your compiler (-O3), and turn off debugging symbols (remove -g) which add overhead. You can get anywhere between 1-20x performance improvement by doing this.

Otherwise, just go buy a cheap video card and put it in your dev machine. Even a $50 CUDA capable card will be significantly faster than emulation.

Actually it seems to use user-level threads, which don’t take advantage of multi-core hardware at all.

I suspect this was done to reduce the overhead of the context switches.

Context switches is what makes the performance of the emulation mode terrible. Running hundreds of pthreads (per block) with synchronization barriers every few instructions inside the inner loop (as most CUDA apps do) is not really going to be efficient…

That is why it is almost always slower than instruction-level emulation at the warp/block level as we do.

I like to boast that Barra is faster than the emulation mode, but it would be more accurate to say that the emulation mode is slow… ;)

But this doesn’t really help when 95% of the time is spent inside the pthread context switch library call. :)

For some reason I seem to remember actually getting a speedup using the emulator on a multicore machine, but I just tested it now and you are absolutely right. Only one OS thread at a time.

Yes, this only would matter if the thread bodies were much more significant than the context switch overhead.

VNC does not disable the GPU because it does not replace your video driver when active like Remote Desktop does.

Has anyone tried MCUDA?

[double post]