Contents of loop failing to translate/compile/run?

Hello, me again. I’ve run into an issue that I think may be a bug. I’m trying to compile the usual suspect but I’ve noticed that the contents of a loop are simply not running. Everything in the loop from 484 to 736 just fails to run. As with my previous compiling problems it used to work on pgfortran which makes me think it’s a compiler issue.

Since the full code is somewhat long, how would I go about checking that the code in question is or isn’t actually being compiled? Or should I just post the full code anyway since I have the lines that aren’t functioning and they’re all in a loop?

System info below

Tue  7 10:50:21 2023       
| NVIDIA-SMI 495.29.05    Driver Version: 470.82.01    CUDA Version: 11.5     |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Quadro RTX 6000     On   | 00000000:06:00.0 Off |                  N/A |
| N/A   30C    P8    13W / 250W |      0MiB / 24222MiB |      0%      Default |
|                               |                      |                  N/A |
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|  No running processes found                                                 |

The machine is using CentOS Linux 8

I compiled it with the following command
Using NVHPC/20.9: nvfortran -cuda -v -gpu=cuda11.0 -o CRAFT CRAFT_CUDA.f90

there is a separate forum for questions pertaining to codes compiled with nvfortran

in any case like this, its a good idea to create a minimal reproducer, for the sake of discussion

My mistake, I thought I was on that forum. Should I delete this and post it there?

I will move it. Note that I suggested creation of a minimal example.

Thank you. I am working on a minimal example

What I’d recommend first is updating you’re compiler to our latest version (

If it is a compiler issue, then it’s possible that the problem could have been fixed already. Also, we’ll have support for CUDA 11.5 as well. Granted, CUDA 11.0 binaries should run fine with a CUDA 11.5 Driver, so unlikely to be the issue, but doesn’t hurt.

If it still fails, then can you add print statements to your code to ensure that it’s getting to CUDA Fortran section? i.e is the problem with the CUDA code or something else?

If it is launching the CUDA kernel, are you capturing the error code via “cudaGetLastError” and if so, what’s it returning?

Ideally if you can put together a minimal reproducing example, that would be ideal. But if it only reproduces in the full code, that’s fine as well. If I can reproduce the issue, then I can usually find the cause.


It’s almost random what, when removed from the loop, will cause the error to go away (unfortunately everything in there needs to be in there in the final version). Reducing it down proved difficult as I can’t seem to find any consistency on what and can’t be removed either within or outside the loop. Nevertheless I managed to get several things out of the way so I hope that is sufficient.

It’s going into the CUDA FORTRAN section and is getting through everything in it up to a specific section of the loop just fine, it’s even doing everything after loop just fine. cudaGetLastError seems to be saying everything is fine.

I’ve included a cut down version. The loop is around 246 to 482, but it only starts ignoring things after 251, everything before and after the loop seem to work just fine. I’ve also included the 2 files needed for it to run.

CRAFT_CUDA.f90 (32.3 KB)
Evolution.dat (624 Bytes)
Init.dat (176 Bytes)

Cluster access is closed for today so I can’t test anything until tomorrow, but I can try a few more things tomorrow if that would help and can maybe get a more cut down version, but again it’s really inconsistent what causes behavior to change.

Why are you setting the number of threads to 1 in the launch configuration?

It seems to be running, just taking a long time. Though if I change the launch config to use “threads3D”, it runs quickly and shows no errors.

 call launchToGPU<<<blocks3D,threads3D>>>(ageArray_d, RshArray_d, VshArray_d, N0Array_d, B0Array_d, T0, Pcr_guess, Pcr_max, xi_inj, X0, &
                                                   PcrOutputArray_d, T2OutputArray_d, B2OutputArray_d, FescOutputArray_d, RtotOutputArray_d, &
                                                   p_outMatrix_d, f0_outMatrix_d, fesc_outMatrix_d)
% nvfortran -cuda CRAFT_CUDA.f90 -fast
% a.out
 size:            5
 Copy time(sec)= 2.159E+00  minutes= 3.598E-02

 All arrays started
 Timesteps detected:            5
 N is           50
 NN is           50
            5           50
            5           50
 no error
 Made it to the end
 GPU runtime(sec)= 0.000E+00  minutes= 0.000E+00

 Run time(sec)= 1.000E-03  minutes= 1.667E-05

That’s strange, so it’s not printing anything from inside the main function for you? Let alone anything from inside the loop like the print saying "are we here?"I was using 1 thread just for testing since I’m using one timestep. Could you test with these initial conditions as well?

Evolution.dat (124 Bytes)
Init.dat (74 Bytes)

On my end I’m getting prints from inside that you don’t seem to be getting, even though I’m running the same CRAFT. I’m using the same initial conditions as you but with 1 instead of 5 timesteps and higher tolerance to make testing faster. I’m not sure why that would cause things to behave so differently.

Ah, yes, you’re correct. It looks like thread block size is too big:

threads3D = dim3(N,N,1)

N is 50, for a total of 2500 threads. The max block size is 1024. Hence the question is why the error handler didn’t see this, which I’m not sure.

Could this be the cause of the original issue? It does match the behavior you describe.

When running with a single thread and the new data file, I see the following error which is coming from the code. Not sure if this is expected or now:

 entering loop          101
 error rsrt:rsrtold   0.9999992370605468        0.9999984741210938
 ERROR: Try a different choice of RsRt_min
FORTRAN STOP: 0: Block (1,1,1), Thread (1,1,1) pgf90_stop08a: block: [0,0,0], thread: [0,0,0] Assertion `FORTRAN_STOP_STATEMENT` failed.
 no error
 Made it to the end
 GPU runtime(sec)= 6.952E+01  minutes= 1.159E+00


The stop is called by the code when it realizes that the numbers it’s producing don’t make sense and won’t converge properly. That’s the error catch that happens because the loop isn’t happening. Forgive me for not specifying.

If you look at the output for that run you should see that it prints “in the loop” and then jumps back to “line 171” without any of the subsequent prints in that loop. Which is what is causing that failure to converge.

As for the 2500, is it possible the compiler didn’t catch that it was too large because it’s based on the parameter “N” which is defined in a module and not in the main code?

The compiler wouldn’t be able to catch this, but I would expect a runtime error indicating a launch failure. And I just noticed the reason why it’s not catching it. Here’s your code:

  errorID = cudaGetLastError()
  write(*,*) cudaGetErrorString(cudaGetLastError())

The second time cudaGetLastError is called, there is no error. However, if you pass the errorID to cudaGetErrorString, then you’ll see the expected “invalid configuration argument” error.

  errorID = cudaGetLastError()
!  write(*,*) cudaGetErrorString(cudaGetLastError())
  write(*,*) cudaGetErrorString(errorID)
% nvfortran -cuda CRAFT_CUDA.f90 -fast ; a.out
 size:            1
 Copy time(sec)= 2.091E+00  minutes= 3.485E-02

 All arrays started
 Timesteps detected:            1
 N is           50
 NN is           50
            1           50
            1           50
 CALLING  launchToGPU             1           50
 invalid configuration argument
 Made it to the end
 GPU runtime(sec)= 0.000E+00  minutes= 0.000E+00

 Run time(sec)= 0.000E+00  minutes= 0.000E+00

I didn’t know get last error didn’t work twice in a row, learn something new every day.

Does it output anything on the version with the loop stuff missing? I assume not since it’s reaching the error catch meaning there’s no cuda error.

Yes, it gives the following:

FORTRAN STOP: 0: Block (1,1,1), Thread (1,1,1) pgf90_stop08a: block: [0,0,0], thread: [0,0,0] Assertion `FORTRAN_STOP_STATEMENT` failed.
 device-side assert triggered
 Made it to the end
 GPU runtime(sec)= 7.035E+01  minutes= 1.172E+00

0: copyout Memcpy (host=0x2cee9e0, dev=0x15485a800a00, size=8) FAILED: 710(device-side assert triggered)

When a Fortran STOP statement is reached in a kernel, a device assertion is made which in turn will cause the CUDA context to be killed. This why there’s the memcpy error.

With OpenACC/OpenMP, our runtime can catch this and in turn abort the host program. But in CUDA Fortran, you’d need the catch the error and call STOP (or whatever error handling you need) from the host. Something like the following:

  errorID = cudaGetLastError()
  if (errorID .ne. 0) then
      write(*,*) "ERROR with launchToGPU ", cudaGetErrorString(errorID)
      STOP errorID

Forgive me, I wasn’t clear. I mean there’s no error explaining why the contents of the loop are being skipped.

Yes, my typo. Max block size is 1024. 2048 is the max threads that can run on a single SM.

Is it possible to fetch an error that may be produced before the end of the GPU code? I’m wondering if maybe the loop is being skipped because of some error that only exists while it’s running and once it reaches the stop it’s overwritten by the “stop” statement. I don’t remember reading about that in the error handling section, but then again I only planned for errors that would stop the code, not ones that it would continue running after.

I don’t have any “exit” or “cycle” statements in that loop, but is it possible one is being put in there implicitly by something?

I’m not aware of a way to capture the error number from the the STOP statement on the host. The assert just gets printed to stderr.

While it may be a bit more work, you can consider passing in a global error variable that gets set and then read the variable when the kernel exits. Calling STOP does have the potential to corrupt the CUDA context so I’m not sure how reliable this would if you did both, rather remove the STOP statements and instead just set the global error variable. Optionally you can also call threadfence so the other threads see the update and then add logic to the code so other threads exit when there’s an error. I’m not entirely sure how this would work for your case, but might be something to investigate.

I don’t have any “exit” or “cycle” statements in that loop, but is it possible one is being put in there implicitly by something?

No, though, it doesn’t look like you’re initializing the “jf” and “Nprove” variables so the if condition is using a garbage values. If I initialize them just before the do while loop, then it runs to completion:

     Nprove = 0
     jf = 0
     print *, "entering loop", Nprove, jf

     do while((Nprove.le.100.and.jf.le.4))
        !print *, "Nprove=", Nprove

        print *, "in loop"

Now I still see the “something funny is happening here” message so I don’t know if the code is correct, just that it doesn’t hit the stop statement.

Ah, I’ve fallen victim to my own laziness. On my machine when using gcc I have it set up to automatically initialize variables to 0. I’m guessing then that nvfortran does not do that.

Figures it would be something like that, I feel a tad silly bringing this to you. For future, is there a compiler flag to catch uninitialized variables like that? I looked at the documentation but didn’t see any that looked like they would have done so.

I’m once again grateful for your time and patience.

We do have the “-Msave” flag which has the side-effect of initializing local variables to zero, but I don’t recommend using it. It’s actual purpose is to apply the SAVE attribute to all local variables which can cause issues in that the value of the local variables will carry over from the last call. Plus I think it’s only available for host code.

Also, I don’t think initializing Nprove and jf upon entry to the subroutine would help here given they need to be initialized each time through the do while and do loops.

On the host, I’ll use Valgrind to detect UMRs but the equivalent compute-sanitizer functionality, “–tool initicheck” is only available to check UMRs with global memory.