Emulation/CPU=correct,Execution/GPU=incorrect emulation

I have been developing some CFD code on a Tesla C870.

To get the parallelisation correct I developed the algorithm in C on a parallel system using openmp, and then amended the C code into CUDA code and tried to run it on the GPU.

The original parallel C/openmp gives the same result as the sequential, only much faster.

The CUDA code gives a load of rubbish.

So I tried the emulation of the GPU on the CPU and the emulation gives the correct results.

So, why is the emulation on the CPU giving correct results but the execution on the GPU is not, and by not I mean the results are ridiculous?

Its unlikely that anyone can comment unless you provide a test app which reproduces the problem.

We have a Tesla C870 inhouse…I found that the emulator performs sequential thread block execution…Maybe the Tesla execution has a race condition among thread blocks…


I am reluctant to release the source code, and I asked for suggestions as to why the CPU emulation worked but the GPU execution didn’t.


I was thinking that maybe a race condition existed between blocks, but the code only has one block size = 400. So maybe a race between warps? Anyway the code contains a few calls to syncthreads to synchronise all threads before execution of the next stage of the calculation.

Because you have a race condition. Because you’re running double precision on the CPU but not on the GPU. Because you’re not initializing the GPU properly. I could go on, but seriously it’s completely impossible to give you any meaningful suggestions without source code.

You may need to modify your kernel dump/return intermediate results (and/or comment away logic within your kernel) in an attempt to pinpoint the logic that has emulator/hardware inconsistency…

To add to this list:

  • Check the return codes from all CUDA functions. If there is a problem, for example in a device-to-host memcpy, then you’ll see whatever junk was already in memory. A common issue is there is a problem with the CUDA device setup that prevents anything from running on the GPU. Then all the CUDA functions will return immediately with errors.

That’s the ticket. If there’s ever anything wrong in CUDA, /* */ everything till the problem goes away. Ahhh debugging in the 21st century B)

To add to the list: I’ve learned that integer right shift is treated differently in emu and gpu. (In emu, >>37 is same as >>5)

If you use the cutil.h macros from the SDK samples and run in [non-emu] Debug you get actual error messages when this occurs. It feels so decadent.

which depends entirely on your compiler, because you’re depending on undefined behavior…

yeah yeah. Next you’re gonna blame me for assuming CUDA uses two’s complement ints. Anyway, fact is fact. Emu and CUDA differ in this regard. No reason to have a discussion about it.

no, they differ if you’re using Visual Studio 2005. gcc and icc do what nvcc does in this regard. that you’re still harping about this “problem” is kind of ridiculous.


Where exactly do you get the idea this is just about Visual Studio?

For those interested, I have some good news (for me anyway).

I reduced the number particles to 32 (1 warp), and the answers from the GPU matched those from the C/omp version, so I am confident that the cuda version is fundamentally sound, but now recognise that there may be some thread synchronisation problem somewhere.

Good news for nvidia (and me) is that the GPU was much faster than the C/omp for this 32 particle problem.

So the next question is, how can I find out where the synchronisation problem is occuring? The fact that the GPU 32 particle solution equals that of the omp version tells me that my synchronisation is not that incorrect, and that it could be something to do with my grid spec. I was declaring 1 grid of 1 block with 400 threads (= 400 particles).

What happens in that case, when I am using 1 C870?

How would declaring multiples of warps perform?

No they differ between visual studio & gcc/icc already. Why are you so stubborn to not admit that?

On NVIDIA hardware it is always doing the same thing, wether it be linux, mac or windows. Or would you like NVIDIA to change it so that it works differently on linux & mac vs. windows? Now there you will have people screaming!

When you only have 32 threads you do not need any synchronization…

So from what I can guess you are definitely missing a syncthreads somewhere.

Not necessarily. At the moment I have only one thread perform one function while the rest wait (I haven’t yet thought how that function can be parallelised). I though the lack of syncthreads would be shown in the emulation anyway.

__syncthreads synchronises threads in a block, not in a warp, is that correct? BUt threads are executed in warps ie groups of 32?

So when working with 400 threads/particles in 1 block and I place a __syncthreads call after a function call in a kernel, not one thread should pass that __syncthreads until all other 399 have reached that particular __syncthreads? So with a maximum of 128 threads executing concurrently on a C870, ie 4 warps, with 400 threads not all threads are being computed concurrently.

I’ll try multiples of warps on both machines. I expect something interesting to happen with 5 warps.

This is just getting silly. Let’s review.

vc++: right-shift is not modulo

gcc/icc: right-shift is modulo 32

Emu mode: right-shift is modulo 32

NVIDIA hardware: right-shift is not modulo

Yes, vc++ happens to have the same behavior as NVIDIA hardware. (Probably because NVIDIA chip architects and Microsoft both realize something… but that’s completely besides the point.)

NVIDIA hardware differs from Emu mode in this regard, which is exactly the topic of this thread. I just don’t understand why people came here to have this flame war over it. Are you trying to somehow back-justify this deficiency? It’s real. It’s gonna trip people up. Ironically, it’s going to trip gcc/icc programmers the most.

Btw, it’s probably possible to fix Emu mode even if its continues to rely on gcc (or open64?) as its compilation engine. I think Emu could be fixed on both Windows and *nix, although it wouldn’t be so bad to have it working on only one platform. Why would consistency matter at all, if that consistency is inconsistent with the thing that actually matters? Let me try to remind you: Emu mode is the only debugging environment in CUDA, and every bug better be reproducible in it! How do you not understand that any discrepancies are deadly?

are you saying that emu mode does the same on linux & windows? As far as I understand emu mode really does only what the underlying compiler on the platform does. So that also means that there is not much that can be done, as the underlying compiler is not under NVIDIA’s control.

I know about debugging and the hell it currently is on CUDA, I bugged the NVIDIA people about it at NVISION with great pleasure.

Frankly, I don’t know what Emu mode does. I’m using it on Vista, and it doesn’t use the underlying compiler, it still uses open64. Bizarrely, it works one way on EmuDebug and another on EmuRelease. More bizzarrely, the modes switched when I made a repro kernel.

In any case the issue actually has a simple fix using a C++ feature that should work whatever the compiler. You just have to overload the >> operator and feed it to a hardware-correct function whenever compiling Emu. This could be done on NVIDIA’s side.