Emulation/CPU=correct,Execution/GPU=incorrect emulation

chrismc · August 26, 2008, 5:39pm

I have been developing some CFD code on a Tesla C870.

To get the parallelisation correct I developed the algorithm in C on a parallel system using openmp, and then amended the C code into CUDA code and tried to run it on the GPU.

The original parallel C/openmp gives the same result as the sequential, only much faster.

The CUDA code gives a load of rubbish.

So I tried the emulation of the GPU on the CPU and the emulation gives the correct results.

So, why is the emulation on the CPU giving correct results but the execution on the GPU is not, and by not I mean the results are ridiculous?

netllama · August 26, 2008, 6:01pm

Its unlikely that anyone can comment unless you provide a test app which reproduces the problem.

waswas · August 26, 2008, 6:13pm

We have a Tesla C870 inhouse…I found that the emulator performs sequential thread block execution…Maybe the Tesla execution has a race condition among thread blocks…

chrismc · August 27, 2008, 6:38am

@netllama

I am reluctant to release the source code, and I asked for suggestions as to why the CPU emulation worked but the GPU execution didn’t.

@waswas

I was thinking that maybe a race condition existed between blocks, but the code only has one block size = 400. So maybe a race between warps? Anyway the code contains a few calls to syncthreads to synchronise all threads before execution of the next stage of the calculation.

tmurray · August 27, 2008, 7:49am

Because you have a race condition. Because you’re running double precision on the CPU but not on the GPU. Because you’re not initializing the GPU properly. I could go on, but seriously it’s completely impossible to give you any meaningful suggestions without source code.

waswas · August 27, 2008, 11:29am

You may need to modify your kernel dump/return intermediate results (and/or comment away logic within your kernel) in an attempt to pinpoint the logic that has emulator/hardware inconsistency…

seibert · August 28, 2008, 1:56am

To add to this list:

Check the return codes from all CUDA functions. If there is a problem, for example in a device-to-host memcpy, then you’ll see whatever junk was already in memory. A common issue is there is a problem with the CUDA device setup that prevents anything from running on the GPU. Then all the CUDA functions will return immediately with errors.

alex_dubinsky · August 28, 2008, 11:14pm

That’s the ticket. If there’s ever anything wrong in CUDA, /* */ everything till the problem goes away. Ahhh debugging in the 21st century B)

To add to the list: I’ve learned that integer right shift is treated differently in emu and gpu. (In emu, >>37 is same as >>5)

alex_dubinsky · August 28, 2008, 11:17pm

If you use the cutil.h macros from the SDK samples and run in [non-emu] Debug you get actual error messages when this occurs. It feels so decadent.

tmurray · August 28, 2008, 11:28pm

which depends entirely on your compiler, because you’re depending on undefined behavior…

alex_dubinsky · August 29, 2008, 1:09am

yeah yeah. Next you’re gonna blame me for assuming CUDA uses two’s complement ints. Anyway, fact is fact. Emu and CUDA differ in this regard. No reason to have a discussion about it.

tmurray · August 29, 2008, 2:40am

no, they differ if you’re using Visual Studio 2005. gcc and icc do what nvcc does in this regard. that you’re still harping about this “problem” is kind of ridiculous.

alex_dubinsky · August 29, 2008, 7:35am

They differ IF YOU’RE USING NVIDIA HARDWARE.

Where exactly do you get the idea this is just about Visual Studio?

chrismc · August 29, 2008, 12:57pm

For those interested, I have some good news (for me anyway).

I reduced the number particles to 32 (1 warp), and the answers from the GPU matched those from the C/omp version, so I am confident that the cuda version is fundamentally sound, but now recognise that there may be some thread synchronisation problem somewhere.

Good news for nvidia (and me) is that the GPU was much faster than the C/omp for this 32 particle problem.

So the next question is, how can I find out where the synchronisation problem is occuring? The fact that the GPU 32 particle solution equals that of the omp version tells me that my synchronisation is not that incorrect, and that it could be something to do with my grid spec. I was declaring 1 grid of 1 block with 400 threads (= 400 particles).

What happens in that case, when I am using 1 C870?

How would declaring multiples of warps perform?

E.D_Riedijk · August 29, 2008, 2:21pm

No they differ between visual studio & gcc/icc already. Why are you so stubborn to not admit that?

On NVIDIA hardware it is always doing the same thing, wether it be linux, mac or windows. Or would you like NVIDIA to change it so that it works differently on linux & mac vs. windows? Now there you will have people screaming!

E.D_Riedijk · August 29, 2008, 2:24pm

When you only have 32 threads you do not need any synchronization…

So from what I can guess you are definitely missing a syncthreads somewhere.

chrismc · August 29, 2008, 3:37pm

Not necessarily. At the moment I have only one thread perform one function while the rest wait (I haven’t yet thought how that function can be parallelised). I though the lack of syncthreads would be shown in the emulation anyway.

__syncthreads synchronises threads in a block, not in a warp, is that correct? BUt threads are executed in warps ie groups of 32?

So when working with 400 threads/particles in 1 block and I place a __syncthreads call after a function call in a kernel, not one thread should pass that __syncthreads until all other 399 have reached that particular __syncthreads? So with a maximum of 128 threads executing concurrently on a C870, ie 4 warps, with 400 threads not all threads are being computed concurrently.

I’ll try multiples of warps on both machines. I expect something interesting to happen with 5 warps.

alex_dubinsky · August 29, 2008, 4:17pm

This is just getting silly. Let’s review.

vc++: right-shift is not modulo

gcc/icc: right-shift is modulo 32

Emu mode: right-shift is modulo 32

NVIDIA hardware: right-shift is not modulo

Yes, vc++ happens to have the same behavior as NVIDIA hardware. (Probably because NVIDIA chip architects and Microsoft both realize something… but that’s completely besides the point.)

NVIDIA hardware differs from Emu mode in this regard, which is exactly the topic of this thread. I just don’t understand why people came here to have this flame war over it. Are you trying to somehow back-justify this deficiency? It’s real. It’s gonna trip people up. Ironically, it’s going to trip gcc/icc programmers the most.

Btw, it’s probably possible to fix Emu mode even if its continues to rely on gcc (or open64?) as its compilation engine. I think Emu could be fixed on both Windows and *nix, although it wouldn’t be so bad to have it working on only one platform. Why would consistency matter at all, if that consistency is inconsistent with the thing that actually matters? Let me try to remind you: Emu mode is the only debugging environment in CUDA, and every bug better be reproducible in it! How do you not understand that any discrepancies are deadly?

E.D_Riedijk · August 30, 2008, 11:37am

are you saying that emu mode does the same on linux & windows? As far as I understand emu mode really does only what the underlying compiler on the platform does. So that also means that there is not much that can be done, as the underlying compiler is not under NVIDIA’s control.

I know about debugging and the hell it currently is on CUDA, I bugged the NVIDIA people about it at NVISION with great pleasure.

alex_dubinsky · August 30, 2008, 4:45pm

Frankly, I don’t know what Emu mode does. I’m using it on Vista, and it doesn’t use the underlying compiler, it still uses open64. Bizarrely, it works one way on EmuDebug and another on EmuRelease. More bizzarrely, the modes switched when I made a repro kernel.

In any case the issue actually has a simple fix using a C++ feature that should work whatever the compiler. You just have to overload the >> operator and feed it to a hardware-correct function whenever compiling Emu. This could be done on NVIDIA’s side.

Topic		Replies	Views
Difference between Device emulation and execution modes CUDA Programming and Performance	0	1096	April 11, 2009
Difference between Device emulation and execution modes CUDA Programming and Performance	8	4452	May 12, 2009
emu vs debug, different values CUDA Programming and Performance	48	15822	February 5, 2009
CUDA/PTX Emulator Would Anyone Be Interested? CUDA Programming and Performance	22	9694	June 25, 2013
Fast DIy device emulation Introductory howto CUDA Programming and Performance	9	7951	June 28, 2008
Different results with -Mcuda=emu / -Mcuda with simple code Legacy PGI Compilers	17	15282	December 10, 2009
EmuDebug question CUDA Programming and Performance	4	3881	January 16, 2009
well how do I know if cuda runs on the gpu CUDA Programming and Performance	20	13595	July 9, 2008
Different results with and without emulation mode CUDA Programming and Performance	6	1664	February 1, 2010
Emulation works, Debug doesn't CUDA Programming and Performance	12	2665	January 29, 2010

Emulation/CPU=correct,Execution/GPU=incorrect emulation

Related topics