CUDA Fortran host code waits eternally.

We are using CUDA Fortran for CFD (SPH) and we are having trouble at runtime (in some machines not in others).
Apparently the host code seems to wait eternally between kernels (we are using different kernels as natural synchronization barrier).
We get no cuda errors in the kernels (as I said in some machines the code executes flawlesly).
We are using only the defaul stream (stream “0” I believe), and adding the instruction cudaDeviceSynchronize() causes the host code to wait forever in that point, even when not using it at all it ends up stopping somewhere.

Has anybody suffered the same problem? honestly we are quite puzzled and cannot continue… as there are no compilation errors nor runtime errors it is impossible for us to fix this thing.

Best regards,

Can you give us any hints on the differences in the machines where it works and those where it doesn’t? Related to either

  1. Type of NVIDIA hardware on the different systems
  2. Compute Capabilities on working and not
  3. OS Versions - might be a long shot. All 64-bit OSes?
  4. PGI Versions. Which versions are you using? Same binary on failing and working machines?
  5. NVIDIA driver version
  6. Is the hang in the same kernel (or type of kernel) everytime?

Well the differences are minimum:

  • Same OS in all cases (Windows 7 64 bits).
  • Same GPU architecture (Fermi) with Compute capabilities 2.0.
  • Same PGI versions as we compile in one machine, then deploy and tests in several machines. we are using teh very latest 12.5
  • We have all machines updated to latest nvidia drivers, the only difference is that one is a laptop and the other two desktops (301.27 and 301.32).
  • Yes the program stops in same point for both machines that does. and the puzzling thing is that stops and waits BETWEEN kernels, as if waiting for synchronization.
  • It only works in the laptop whith the only difference is that it has a less powerfull gpu and it is sharing the memory.

Is it possible for us to get the binary or source? I think we’ll need to do some low-level digging. To my knowledge we haven’t seen this behavior before. If sending us either source or binary is possible, mail it to

I just sent the source code with some test data.
I attach some instructions as well.

thanks a lot.

I could send the as well but I´m not in the office at the time. In case you need that we could send it tomorrow.

Thanks, I’m looking at it. I’ll send email if I have questions.

  • Brent

Great catch Brent, we had a bug in the code, a sneaky one as it wanst bad enough it would crash the system, it actually was tolerable in some machines.

Great catch specially given that problem would only affcet a tiny fraction of the particles in the system.

Amazing job, thanks man :-)