I was thinking that maybe a race condition existed between blocks, but the code only has one block size = 400. So maybe a race between warps? Anyway the code contains a few calls to syncthreads to synchronise all threads before execution of the next stage of the calculation.
Because you have a race condition. Because you’re running double precision on the CPU but not on the GPU. Because you’re not initializing the GPU properly. I could go on, but seriously it’s completely impossible to give you any meaningful suggestions without source code.
Check the return codes from all CUDA functions. If there is a problem, for example in a device-to-host memcpy, then you’ll see whatever junk was already in memory. A common issue is there is a problem with the CUDA device setup that prevents anything from running on the GPU. Then all the CUDA functions will return immediately with errors.
For those interested, I have some good news (for me anyway).
I reduced the number particles to 32 (1 warp), and the answers from the GPU matched those from the C/omp version, so I am confident that the cuda version is fundamentally sound, but now recognise that there may be some thread synchronisation problem somewhere.
Good news for nvidia (and me) is that the GPU was much faster than the C/omp for this 32 particle problem.
So the next question is, how can I find out where the synchronisation problem is occuring? The fact that the GPU 32 particle solution equals that of the omp version tells me that my synchronisation is not that incorrect, and that it could be something to do with my grid spec. I was declaring 1 grid of 1 block with 400 threads (= 400 particles).
What happens in that case, when I am using 1 C870?
No they differ between visual studio & gcc/icc already. Why are you so stubborn to not admit that?
On NVIDIA hardware it is always doing the same thing, wether it be linux, mac or windows. Or would you like NVIDIA to change it so that it works differently on linux & mac vs. windows? Now there you will have people screaming!
Not necessarily. At the moment I have only one thread perform one function while the rest wait (I haven’t yet thought how that function can be parallelised). I though the lack of syncthreads would be shown in the emulation anyway.
__syncthreads synchronises threads in a block, not in a warp, is that correct? BUt threads are executed in warps ie groups of 32?
So when working with 400 threads/particles in 1 block and I place a __syncthreads call after a function call in a kernel, not one thread should pass that __syncthreads until all other 399 have reached that particular __syncthreads? So with a maximum of 128 threads executing concurrently on a C870, ie 4 warps, with 400 threads not all threads are being computed concurrently.
I’ll try multiples of warps on both machines. I expect something interesting to happen with 5 warps.
Yes, vc++ happens to have the same behavior as NVIDIA hardware. (Probably because NVIDIA chip architects and Microsoft both realize something… but that’s completely besides the point.)
NVIDIA hardware differs from Emu mode in this regard, which is exactly the topic of this thread. I just don’t understand why people came here to have this flame war over it. Are you trying to somehow back-justify this deficiency? It’s real. It’s gonna trip people up. Ironically, it’s going to trip gcc/icc programmers the most.
Btw, it’s probably possible to fix Emu mode even if its continues to rely on gcc (or open64?) as its compilation engine. I think Emu could be fixed on both Windows and *nix, although it wouldn’t be so bad to have it working on only one platform. Why would consistency matter at all, if that consistency is inconsistent with the thing that actually matters? Let me try to remind you: Emu mode is the only debugging environment in CUDA, and every bug better be reproducible in it! How do you not understand that any discrepancies are deadly?
are you saying that emu mode does the same on linux & windows? As far as I understand emu mode really does only what the underlying compiler on the platform does. So that also means that there is not much that can be done, as the underlying compiler is not under NVIDIA’s control.
I know about debugging and the hell it currently is on CUDA, I bugged the NVIDIA people about it at NVISION with great pleasure.
Frankly, I don’t know what Emu mode does. I’m using it on Vista, and it doesn’t use the underlying compiler, it still uses open64. Bizarrely, it works one way on EmuDebug and another on EmuRelease. More bizzarrely, the modes switched when I made a repro kernel.
In any case the issue actually has a simple fix using a C++ feature that should work whatever the compiler. You just have to overload the >> operator and feed it to a hardware-correct function whenever compiling Emu. This could be done on NVIDIA’s side.