We are using CUDA Fortran for CFD (SPH) and we are having trouble at runtime (in some machines not in others).
Apparently the host code seems to wait eternally between kernels (we are using different kernels as natural synchronization barrier).
We get no cuda errors in the kernels (as I said in some machines the code executes flawlesly).
We are using only the defaul stream (stream “0” I believe), and adding the instruction cudaDeviceSynchronize() causes the host code to wait forever in that point, even when not using it at all it ends up stopping somewhere.
Has anybody suffered the same problem? honestly we are quite puzzled and cannot continue… as there are no compilation errors nor runtime errors it is impossible for us to fix this thing.
Same GPU architecture (Fermi) with Compute capabilities 2.0.
Same PGI versions as we compile in one machine, then deploy and tests in several machines. we are using teh very latest 12.5
We have all machines updated to latest nvidia drivers, the only difference is that one is a laptop and the other two desktops (301.27 and 301.32).
Yes the program stops in same point for both machines that does. and the puzzling thing is that stops and waits BETWEEN kernels, as if waiting for synchronization.
It only works in the laptop whith the only difference is that it has a less powerfull gpu and it is sharing the memory.
Is it possible for us to get the binary or source? I think we’ll need to do some low-level digging. To my knowledge we haven’t seen this behavior before. If sending us either source or binary is possible, mail it to trs@pgroup.com.
Great catch Brent, we had a bug in the code, a sneaky one as it wanst bad enough it would crash the system, it actually was tolerable in some machines.
Great catch specially given that problem would only affcet a tiny fraction of the particles in the system.