Why my ocean wave simulation package accelerated by OpenACC worked well in any card except A100

I have an ocean-wave simulation package which is accelerated by OpenACC+OpenMPI. It is curious that the package can fully exploit GPU resource in any card (i.e., V100) except A100. In our recently-deployed 8-card A100 server, the package can only run normally with compilation environment of CUDA11.2+Driver460. While using CUDA 11.3 and above, the simulation package is not fully accelerated, and nvidia-smi shows that while maximum power stays between 60-70watt constantly, GPU utilization percentage can reaches 100% unexpectedly.
Does anyone have similar experience? Is there any possibility that A100 has some conflict with OpenACC?

Hi YuanYe1,

Is there any possibility that A100 has some conflict with OpenACC?

No, OpenACC works well on A100s, so that shouldn’t be the issue.

the package can only run normally with compilation environment of CUDA11.2+Driver460.

How long ago did you did you compile the code? Perhaps you built with a software stack that predates an A100 or CUDA 11.3?

Also, the A100 is a much bigger device. Maybe the workload is simply too small to take advantage of using 8 of them?

Have you run any profiling using Nsight-Systems and Nsight-Compute to get a better understands of the performance characteristics of the code?

-Mat

Hi Mat,
Glad to hear from you. This is a F90 code package for wave simulation (WAM 6) augmented by OpenACC, and the simulation case is a global scale (workload is not a problem). Another information is that all of my code packages (including other global ocean circulation models with OpenACC) can only run normally with CUDA 11.2 and Driver 460 in A100. What I deployed is a INSPUR GPU server by a Chinese manufacture. When complied by CUDA 11.3 and abover, I am surprised why A100 power level maintained between 60-70 watt while GPU utilization percentage reaching 100% constantly. However, this is not the case for any other GPU devices. I am used to profile code with nvprof, and this time I have not tried on the Nsight. I will follow your suggestion of profiling the code with Nsight, to see what happen.
Thanks,
regards.
Ye

Hi Mat,
Following your instruction, I used Nsight to profile the code package, and compared the output of different compilation environment (11.2 and 11.6). Now the code package can run normally. FYI, I found that:

  1. In 11.6, write(,) can not reside in ACC LOOP. If not, compiler may prevent the loop to be parallelized; In 11.2, the code can be parallelized normally.
  2. In a 2-layer nested loop, when calling a FUNCTION, for example,
    do j=1,N
    do i = 1,M
    TMP(i,j) = Whatever_Function( VAR1(i), VAR2(j) )
    enddo
    enddo,
    VAR1 and VAR2 must be the same shape of TMP (2-dim), The 11.6 compiler may prevent the parallelization and give " Loop carried dependence due to exposed use of VAR1(:) and VAR2(:)".

Hi Ye,

By 11.2 and 11.6, I’m assuming you mean two different versions of the NV HPC SDK, most likely 21.7 vs 22.5.

For #1, yes there was a slight changing the behavior of between 21.7 and 21.9 with “write” where it’s now a subroutine call. Since Fortran passes references to subroutines, this will prevent the compiler from auto-parallelizing the loop since it could have a dependency. The solution would be to use “loop independent” on the loop, or switch from using the “kernels” construct to “parallel”

For #2, do you have a reproducing example? I doubt the shape of the arguments would effect this, so assume something else is going on.

-Mat