Is Emulation multithreaded? I wonder if it's not mono-thread


As I test some examples of CUDA 2.0 (scalarProd for example) on emulation vs. native, I discover that my Core 2 Duo (2 core) seems to be loaded just over 50%.

For ease of implementation it seems to me evident that nVidia may have choosen to mono-thread the execution of kernels.
As I didn’t see that in the documentation, could anyone confirm me that emulation is mono-threaded or infirm it?

As far as I know, yes indeed. CUDA 2.1 will have a multi-core compiler target which allows you to run your CUDA code on a multiprocessor CPU (although it is possibly not the same as device emulation). Or at least this was promised at NVISION :)

Is there a summary page showing what else was promised at NVISION? Those who couldn’t go are dying to know…

When I get back home, I will try to write up a small report and put it up here.

There is a paper about it:


So we will se pretty decent performance. (Better than native C implementation but probably not better than an optimized library)

I am not sure that this is the same thing. I believe NVIDIA did their own thing for this, MCUDA was mentioned in the talk of John Stone, and if NVIDIA incorporated MCUDA I would have expected David Kirk to say so.

Yeah, in order to sell CUDA to general developers (ie, games, etc), nvidia wants to be able to say “everything will work even if your user doesn’t have an NVIDIA card.” Hence CPU support is a priority. This won’t be like “emulation” mode where you can debug etc. This will be ordinary Release mode.

What I’m wondering is how far they’ll take it. Obviously they’ll do multicore. But in theory, they might be able to pull the same “Super-SIMD” trick on the CPU like they did on the GPU when they moved from 4-vectors to scalars. A 4-float SSE instruction would actually be applied to a warp with four threads. With four cores, you’d have 16 threads running simultaneously. It’d be optimized to the max and all you have to do is program to the CUDA model (which is way easier than messing with threads and SSE intrinsics!). At least… that’s what I hope.

The paper describes the approach pretty well it seems (I’ve only started reading it).

They indeed plan to use CPU SIMD instruction sets.

I had a chance to meet the guys behind MCUDA at NVISION. NVIDIA’s nvcc --multicore is very similar, but quoting John Stratton: “They found a few optimizations we missed”. So NVIDIA’s implementation should be even better.

MCUDA and nvcc --multicore were developed separately, but with a certain amount of collaboration between the university and NVIDIA. I’m know most of the people at NVIDIA would love to share all the details, but they are limited in what sort of information they can release, especially about upcoming features and proprietary information. Universities aren’t so limited, thus we have the MCUDA paper :)

They talk about SSE and SIMD in the introduction (“Thread blocks often have very regular control flow patterns among constituent logical threads, making it likely that the SIMD instructions common in current x86 processors can be effectively used in many cases”), but it doesn’t sound like exactly what I’m saying. Sounds close, though. Anyway, they don’t talk about it again and the use of SSE in MCUDA right now is limited to the C compiler autovectorizing loops.

In fact I’m not sure their approach will at all be suited to my “four threads in one SSE instruction” idea. It will be incredibly difficult to do that starting from C code (you basically have to write your own compiler). It could be pretty feasible to translate PTX assembly to SSE assembly, however.