Is Emulation multithreaded? I wonder if it's not mono-thread

parallelis · August 26, 2008, 11:09pm

Allo,

As I test some examples of CUDA 2.0 (scalarProd for example) on emulation vs. native, I discover that my Core 2 Duo (2 core) seems to be loaded just over 50%.

For ease of implementation it seems to me evident that nVidia may have choosen to mono-thread the execution of kernels.
As I didn’t see that in the documentation, could anyone confirm me that emulation is mono-threaded or infirm it?

E.D_Riedijk · August 26, 2008, 11:57pm

As far as I know, yes indeed. CUDA 2.1 will have a multi-core compiler target which allows you to run your CUDA code on a multiprocessor CPU (although it is possibly not the same as device emulation). Or at least this was promised at NVISION :)

cbuchner1 · August 27, 2008, 10:38am

Is there a summary page showing what else was promised at NVISION? Those who couldn’t go are dying to know…

E.D_Riedijk · August 27, 2008, 2:48pm

When I get back home, I will try to write up a small report and put it up here.

Fuchs · August 27, 2008, 6:18pm

There is a paper about it:

[url=“http://www.crhc.uiuc.edu/IMPACT/ftp/report/impact-08-01-mcuda.pdf”]http://www.crhc.uiuc.edu/IMPACT/ftp/report...08-01-mcuda.pdf[/url]

So we will se pretty decent performance. (Better than native C implementation but probably not better than an optimized library)

E.D_Riedijk · August 27, 2008, 9:09pm

I am not sure that this is the same thing. I believe NVIDIA did their own thing for this, MCUDA was mentioned in the talk of John Stone, and if NVIDIA incorporated MCUDA I would have expected David Kirk to say so.

alex_dubinsky · August 28, 2008, 4:59am

Yeah, in order to sell CUDA to general developers (ie, games, etc), nvidia wants to be able to say “everything will work even if your user doesn’t have an NVIDIA card.” Hence CPU support is a priority. This won’t be like “emulation” mode where you can debug etc. This will be ordinary Release mode.

What I’m wondering is how far they’ll take it. Obviously they’ll do multicore. But in theory, they might be able to pull the same “Super-SIMD” trick on the CPU like they did on the GPU when they moved from 4-vectors to scalars. A 4-float SSE instruction would actually be applied to a warp with four threads. With four cores, you’d have 16 threads running simultaneously. It’d be optimized to the max and all you have to do is program to the CUDA model (which is way easier than messing with threads and SSE intrinsics!). At least… that’s what I hope.

_Big_Mac · August 28, 2008, 12:57pm

The paper describes the approach pretty well it seems (I’ve only started reading it).

They indeed plan to use CPU SIMD instruction sets.

MisterAnderson42 · August 28, 2008, 1:46pm

I had a chance to meet the guys behind MCUDA at NVISION. NVIDIA’s nvcc --multicore is very similar, but quoting John Stratton: “They found a few optimizations we missed”. So NVIDIA’s implementation should be even better.

MCUDA and nvcc --multicore were developed separately, but with a certain amount of collaboration between the university and NVIDIA. I’m know most of the people at NVIDIA would love to share all the details, but they are limited in what sort of information they can release, especially about upcoming features and proprietary information. Universities aren’t so limited, thus we have the MCUDA paper :)

alex_dubinsky · August 28, 2008, 5:04pm

They talk about SSE and SIMD in the introduction (“Thread blocks often have very regular control flow patterns among constituent logical threads, making it likely that the SIMD instructions common in current x86 processors can be effectively used in many cases”), but it doesn’t sound like exactly what I’m saying. Sounds close, though. Anyway, they don’t talk about it again and the use of SSE in MCUDA right now is limited to the C compiler autovectorizing loops.

alex_dubinsky · August 28, 2008, 5:10pm

In fact I’m not sure their approach will at all be suited to my “four threads in one SSE instruction” idea. It will be incredibly difficult to do that starting from C code (you basically have to write your own compiler). It could be pretty feasible to translate PTX assembly to SSE assembly, however.

Topic		Replies	Views
Multicore CPU to emulate CUDA device Utilize multi-core CPU to speed emulation? CUDA Programming and Performance	11	2911	August 3, 2009
CPU Support? CUDA Programming and Performance	15	3050	May 4, 2009
For enthusiasts Future CUDA versions CUDA Programming and Performance	3	2718	June 10, 2008
Fast DIy device emulation Introductory howto CUDA Programming and Performance	9	7937	June 28, 2008
thread CUDA Programming and Performance	4	1539	January 26, 2009
Emulation/CPU=correct,Execution/GPU=incorrect emulation CUDA Programming and Performance	26	21476	September 2, 2008
CUDA SUCKS!!! Why <block, thread> cannot be judged by itself CUDA Programming and Performance	20	8043	February 17, 2015
How to limit number of CUDA Cores CUDA Programming and Performance	7	6023	April 22, 2016
SIMT == SIMD? CUDA Programming and Performance	4	25897	April 3, 2009
\|\| programming, basic question CUDA Programming and Performance	18	1287	April 30, 2018

Is Emulation multithreaded? I wonder if it's not mono-thread

Related topics