Nsys profile mpirun -np 1 ./MyOpenACC_App ./input.file has float point error

Hi, my app uses OpenACC, it works well without nsys profile. i’d like to use nsys profile to report the app’s hotspots/bottlenecks, but I got Floating point exception. The whole errors are:

nsys profile --trace=cuda --sample=cpu mpirun --allow-run-as-root -np 1 ./MyOpenACC_App ./input.file

MPI Enabled; Number of MPI Processes: 1
OpenMP Disabled

MPI version: 3.1
MPI library version: Open MPI v3.1.5, package: Open MPI qa@sky2 Distribution, ident: 3.1.5, repo rev: v3.1.5, Nov 15, 2019

Job TITLE : General purpose input file to test app timings
Job ID string : app_32a

[noh4tgwkhu:09412] *** Process received signal ***
[noh4tgwkhu:09412] Signal: Floating point exception (8)
[noh4tgwkhu:09412] Signal code: Invalid floating point operation (7)
[noh4tgwkhu:09412] Failing at address: 0x7f8ebdb38af4
[noh4tgwkhu:09412] [ 0] /opt/nvidia/hpc_sdk/Linux_x86_64/23.7/comm_libs/openmpi/openmpi-3.1.5/lib/libopen-pal.so.40(+0xbe284)[0x7f8ecfd58284]
[noh4tgwkhu:09412] [ 1] /opt/nvidia/hpc_sdk/Linux_x86_64/23.7/profilers/Nsight_Systems/target-linux-x64/libToolsInjection64.so(+0x4df306)[0x7f8ed446a306]
[noh4tgwkhu:09412] [ 2] /lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f8ed0f97420]
[noh4tgwkhu:09412] [ 3] /opt/nvidia/hpc_sdk/Linux_x86_64/23.7/profilers/Nsight_Systems/target-linux-x64/libcupti.so.12.1(+0x285af4)[0x7f8ebdb38af4]
[noh4tgwkhu:09412] [ 4] /opt/nvidia/hpc_sdk/Linux_x86_64/23.7/profilers/Nsight_Systems/target-linux-x64/libcupti.so.12.1(+0x117ca1)[0x7f8ebd9caca1]
[noh4tgwkhu:09412] [ 5] /opt/nvidia/hpc_sdk/Linux_x86_64/23.7/profilers/Nsight_Systems/target-linux-x64/libcupti.so.12.1(+0x21d832)[0x7f8ebdad0832]
[noh4tgwkhu:09412] [ 6] /opt/nvidia/hpc_sdk/Linux_x86_64/23.7/profilers/Nsight_Systems/target-linux-x64/libcupti.so.12.1(+0x21fb3e)[0x7f8ebdad2b3e]
[noh4tgwkhu:09412] [ 7] /opt/nvidia/hpc_sdk/Linux_x86_64/23.7/profilers/Nsight_Systems/target-linux-x64/libcupti.so.12.1(+0x2a33f8)[0x7f8ebdb563f8]
[noh4tgwkhu:09412] [ 8] /opt/nvidia/hpc_sdk/Linux_x86_64/23.7/profilers/Nsight_Systems/target-linux-x64/libToolsInjection64.so(+0x4ec36d)[0x7f8ed447736d]
[noh4tgwkhu:09412] [ 9] /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7f8ed0f8b609]
[noh4tgwkhu:09412] [10] /lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f8ed0827133]
[noh4tgwkhu:09412] *** End of error message ***

Can anyone help me out with this? thanks.

Hello @honggangwang1979 and welcome to the NVIDIA developer forums!

We have dedicated forum categories for NSIGHT, so I took the liberty to move you to one where I think you will most likely receive the help you need.

Thanks!

@rdietrich This may be related to the other issue I asked you to look into.

There are a few things we should try to approach the problem.

  • Try the OpenMPI 4.1.5, which is shipped with the HPC-SDK 23.7?
  • Try the most recent Nsight System 2023.4.1?
  • Try without CUDA tracing (-t none or -t mpi,nvtx).

If the first two bullet points do not help, we need to know more about the execution environment. I assume that you are running Nsight Systems inside a container, since you are using --allow-run-as-root, right? If so, could you tell us which one, maybe an NGC container? What is the CUDA driver version? You are probably using CTK 12.2 form the HPC-SDK 23.7, right?

Hi rdietrich,

Thank you so much for your response.

  • Try the OpenMPI 4.1.5, which is shipped with the HPC-SDK 23.7?

    my version of HPC-SDK 23.7 was shipped with OpenMPI 3.1.5, as shown below:
    mpifort --showme
    nvfortran -I/opt/nvidia/hpc_sdk/Linux_x86_64/23.7/comm_libs/openmpi/openmpi-3.1.5/include -I/opt/nvidia/hpc_sdk/Linux_x86_64/23.7/comm_libs/openmpi/openmpi-3.1.5/lib -Wl,-rpath -Wl,$ORIGIN:$ORIGIN/…/…/lib:$ORIGIN/…/…/…/lib:$ORIGIN/…/…/…/compilers/lib:$ORIGIN/…/…/…/…/compilers/lib:$ORIGIN/…/…/…/…/…/compilers/lib -rpath /opt/nvidia/hpc_sdk/Linux_x86_64/23.7/comm_libs/openmpi/openmpi-3.1.5/lib -L/opt/nvidia/hpc_sdk/Linux_x86_64/23.7/comm_libs/openmpi/openmpi-3.1.5/lib -lmpi_usempif08 -lmpi_usempi_ignore_tkr -lmpi_mpifh -lmpi

  • Try the most recent Nsight System 2023.4.1 ?
    When I run this rpm, I got:
    rpm -i *.rpm
    rpm: RPM should not be used directly install RPM packages, use Alien instead!
    rpm: However assuming you know what you are doing…
    warning: NsightSystems-linux-cli-public-2023.4.1.84-3351519.rpm: Header V3 RSA/SHA512 Signature, key ID 7fa2af80: NOKEY
    error: Failed dependencies:
    glibc is needed by nsight-systems-cli-2023.4.1-2023.4.1.84_234133515197v0-0.x86_64
    /bin/sh is needed by nsight-systems-cli-2023.4.1-2023.4.1.84_234133515197v0-0.x86_64

  • Try without CUDA tracing (-t none or -t mpi,nvtx).
    both -t none and -t mpi or -t nvtx could let the code go and generate some .rep files, but when I use nsys stats *rep, I got something like:

SKIPPED: report1.sqlite does not contain OS Runtime trace data.
SKIPPED: report1.sqlite does not contain CUDA trace data.

However, the good things are that now I can get some useful information by setting environment export PGI_ACC_TIME=1 before run my code, like this:
4362: compute region reached 687 times
4364: kernel launched 687 times
grid: [48] block: [128]
elapsed time(us): total=47,638 max=1,466 min=0 avg=69
4385: kernel launched 687 times
grid: [48] block: [128]
elapsed time(us): total=291,603 max=3,254 min=65 avg=424
For now, this information is ok for me to check the bottleneck of the application, although not perfect.

Now I have another question:
When I run my code in P5000, both mesh size of 32x32x32 and 64x64x64 will work, however if I run it in A4000, only the smaller one work, the larger one will have this error:

call to cuEventSynchronize returned error 700: Illegal address during kernel execution

When I debug it with cuda-gdb, I got this:

Thread 1 received signal CUDA_EXCEPTION_5, Warp Out-of-range Address.
[Switching focus to CUDA kernel 2, grid 9, block (106,0,0), thread (104,0,0), device 0, sm 14, warp 0, lane 8]
cuda-gdb/12/gdb/gdbtypes.c:5831: internal-error: copy_type: Assertion `type->is_objfile_owned ()’ failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.

----- Backtrace -----
0x00007fffd337e300 in MyApple_4212_gpu<<<(190,1,1),(128,1,1)>>> (0x62d087 ???
0x9e1344 ???
0x9e169c ???
0xb94ac1 ???
0x7a3036 ???
0x7a5967 ???
0x7a6469 ???
0x9f6bdb ???
0x9e56fa ???
0x71a479 ???
0x7303c2 ???
0x730a45 ???
0x78f849 ???
0x97071e ???
0x971239 ???
0x972b32 ???
0x973315 ???
0x7daacb ???
0x65ff9a ???
0x7e9813 ???
0x7dc283 ???
0x7e7030 ???
0xb9554c ???
0xb95736 ???
0x828fe4 ???
0x82a8a4 ???
0x5679e4 ???
0x7efe97b05082 __libc_start_main
…/csu/libc-start.c:308
0x56d8f4 ???
0xffffffffffffffff ???

cuda-gdb/12/gdb/gdbtypes.c:5831: internal-error: copy_type: Assertion `type->is_objfile_owned ()’ failed.
A problem internal to GDB has been detected,
This is a bug, please report it. For instructions, see:
https://www.gnu.org/software/gdb/bugs/.

(I haven’t reported it yet to gnu.org)

Thank you very much!

The Open MPI 4.1.5 binaries are located in /opt/nvidia/hpc_sdk/Linux_x86_64/23.7/comm_libs/openmpi4/bin. Just use mpicc, mpifort, etc. and the correct MPI libraries will be linked.

If the rpm package does not work, you can always use the .run installer.

The last bullet point was just to check, whether it’s likely that the issue is related to CUDA tracing. Does the error occur, if you execute nsys profile --trace=osrt,nvtx --sample=cpu mpirun --allow-run-as-root -np 1 ./MyOpenACC_App ./input.file?

Concerning the “Illegal address during kernel execution”, I can only guess. It can be a bad address that has not been mapped to the device or out-of-bounds access has been performed. What’s the CC (compute capability) you are compiling for? Are you using managed memory? Maybe PGI_ACC_DEBUG=1 can give us a few more hints.

(1) Yes, I am using mpifort, as shown below:

which mpifort

/opt/nvidia/hpc_sdk/Linux_x86_64/23.7/comm_libs/mpi/bin/mpifort

but when I use mpifort to compile, it looks like openmpi-3.1.5 is linked instead of 4.1.5, which can be seen in the output of my application:

Current Date : November 20, 2023 19:16:48
Revision : -dirty-master
Revision Date : Mon Nov 13 21:28:41 2023 +0000
Compiler : Gnu gfortran 9.4.0-1ubuntu1~20.04.2)
Compilation Date : Nov 20, 2023 16:57:04

MPI Enabled; Number of MPI Processes: 1
OpenMP Disabled

MPI version: 3.1
MPI library version: Open MPI v3.1.5, package: Open MPI qa@sky2 Distribution, ident: 3.1.5, repo rev: v3.1.5, Nov 15, 2019

So, how can I suppress the link of 3.1.5 and use 4.1.5?

(2) nsys profile --trace=osrt,nvtx --sample=cpu mpirun --allow-run-as-root -np 1 ./MyOpenACC_App ./input.file does somehow work for me, no errors, but it only output OS Runtime Summary (osrt_sum), like:
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name


37.4 100951615058 446 226348912.7 100132889.5 0 33588383476 1760055662.1 poll
25.1 67557072263 19 3555635382.3 2001862734.0 64066833 33779072565 7332477206.5 select
25.0 67407107911 73 923385039.9 101277.0 390 33583880626 5519549803.6 epoll_wait
12.4 33505037920 67 500075192.8 500071686.0 500062409 500119798 11537.6 pthread_cond_timedwait
0.0 74117751 469 158033.6 9514.0 684 24276887 1424818.4 ioctl

There are no gpu related information. I did see a lot of SKIPPED information like:

Processing [report3.sqlite] with [/opt/nvidia/hpc_sdk/Linux_x86_64/23.7/profilers/Nsight_Systems/host-linux-x64/reports/cuda_api_sum.py]…
SKIPPED: report3.sqlite does not contain CUDA trace data.

Processing [report3.sqlite] with [/opt/nvidia/hpc_sdk/Linux_x86_64/23.7/profilers/Nsight_Systems/host-linux-x64/reports/cuda_gpu_kern_sum.py]…
SKIPPED: report3.sqlite does not contain CUDA kernel data.

Processing [report3.sqlite] with [/opt/nvidia/hpc_sdk/Linux_x86_64/23.7/profilers/Nsight_Systems/host-linux-x64/reports/cuda_gpu_mem_time_sum.py]…
SKIPPED: report3.sqlite does not contain GPU memory data.

Processing [report3.sqlite] with [/opt/nvidia/hpc_sdk/Linux_x86_64/23.7/profilers/Nsight_Systems/host-linux-x64/reports/cuda_gpu_mem_size_sum.py]…
SKIPPED: report3.sqlite does not contain GPU memory data.

Processing [report3.sqlite] with [/opt/nvidia/hpc_sdk/Linux_x86_64/23.7/profilers/Nsight_Systems/host-linux-x64/reports/openmp_sum.py]…
SKIPPED: report3.sqlite does not contain OpenMP event data.

Processing [report3.sqlite] with [/opt/nvidia/hpc_sdk/Linux_x86_64/23.7/profilers/Nsight_Systems/host-linux-x64/reports/opengl_khr_range_sum.py]…
SKIPPED: report3.sqlite does not contain KHR Extension (KHR_DEBUG) data.

Processing [report3.sqlite] with [/opt/nvidia/hpc_sdk/Linux_x86_64/23.7/profilers/Nsight_Systems/host-linux-x64/reports/opengl_khr_gpu_range_sum.py]…
SKIPPED: report3.sqlite does not contain GPU KHR Extension (KHR_DEBUG) data.

Processing [report3.sqlite] with [/opt/nvidia/hpc_sdk/Linux_x86_64/23.7/profilers/Nsight_Systems/host-linux-x64/reports/vulkan_marker_sum.py]…
SKIPPED: report3.sqlite does not contain Vulkan Debug Extension (Vulkan Debug Util) data.

Processing [report3.sqlite] with [/opt/nvidia/hpc_sdk/Linux_x86_64/23.7/profilers/Nsight_Systems/host-linux-x64/reports/vulkan_gpu_marker_sum.py]…
SKIPPED: report3.sqlite does not contain GPU Vulkan Debug Extension (GPU Vulkan Debug markers) data.

(3) about “Illegal address during kernel execution”, I am using A4000, so the nvfortran/mpifort should be able to automatically apply the cc?

mpifort --version

nvfortran 23.7-0 64-bit target on x86-64 Linux -tp icelake-server
NVIDIA Compilers and Tools
Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Jun_13_19:16:58_PDT_2023
Cuda compilation tools, release 12.2, V12.2.91
Build cuda_12.2.r12.2/compiler.32965470_0

I am using the OpenACC shipped with HPC-SDK 23.7, not sure if it uses the managed memory (although I think so).

Thank you so much!

Best Regards.

To use Open MPI 4.1.5, you can use the absolute path /opt/nvidia/hpc_sdk/Linux_x86_64/23.7/comm_libs/openmpi4/bin/mpifort or export PATH=/opt/nvidia/hpc_sdk/Linux_x86_64/23.7/comm_libs/openmpi4/bin:$PATH before calling mpifort.

If the issue is related to CUPTI, you should try nsys profile --trace=cuda --cuda-graph-trace=node --sample=cpu mpirun --allow-run-as-root -np 1 ./MyOpenACC_App ./input.file.

I guess that the error is related to the OpenACC pragmas in the code, maybe some array indexing issue with C vs. Fortran style or data not being “present” on the GPU. If you search for “700: Illegal address during kernel execution” in the forum, you get a couple of hits. Maybe these threads already help? Otherwise we should look at your code, especially the OpenACC pragmas.

Thanks rdietrich.

(1) My code looks like a three level of loops like this:

!$ACC kernels
!$ACC loop independent gang
DO I=1, II
! do something simple here
!$ACC loop independent worker
DO M=1, MM
!do something simple here
!$ACC loop independent vector
do n=1, nn
!do something here
enddo
ENDDO

!do something simple here
!$ACC loop independent worker
DO J=1, JJ
! do something simple here
if ( STAT == .TRUE. ) then
!$ACC loop independent vector
DO K=1, KK
do something simple here
ENDDO

else
!$ACC loop independent vector

DO K=1, KK
do something simple here
ENDDO

endif
ENDDO
ENDDO

(2) After a series of tests, I found that the code works for both small and large mesh sizes in P5000, RTX4000, and RTX5000, but will fail in A4000 with the 700 error. However, if I explicitly set the number of gangs in the first level of the loops (just under the !$ACC kernels), the code works for both small mesh size of 32 and larger size of 64 in A4000 when the gang number is less than 64. At the critical point of 64, the small mesh size will survive, but the larger mesh size will fail with the 700 error. When it failed, I found the number of gangs the system automatically assigned to the top level of the loops was 94, which was greater than 64 so the code failed. Therefore, the error may be somehow related to the device resource allocation in runtime?

Thanks.

Sincerely,

Honggang Wang.

Hi Honggang Wang,

Robert asked be to jump in since I support the NVHPC Compilers and OpenACC.

Given the details you describe, my best guess is that you’re encountering a stack overflow. Try setting the environment variable “NV_ACC_CUDA_STACKSIZE” to a larger value such as 32 or 64MB to see if that helps. Note the exact stack size will vary by device but does have a hard limit so you can’t go too large and might need to lower the value.

If you are using automatic arrays or allocating an array on the device, then you could be getting a heap overflow, in which case set the environment variable “NV_ACC_CUDA_HEAPSIZE” to a larger value. There’s no hard limit on the heap size (just the limit of available device memory).

I’ve also seen issues like this (i.e. works on a smaller size, fails on a larger size) when the size of a single object is larger than 2GB. Such as the size in bytes of an array or the aggregate size of a shared array. However, I’d expect the other devices to fail as well if this were the case here. Just in case, to fix this add the compiler flag “-mcmodel=medium” or “-Mlarge_arrays” to allow for objects larger than 2GB.

If none of these options help, please provide a reproducing example. I don’t have access to an A4000 so may not be able to reproduce the error, but seeing the complete code might give some clues.

-Mat

Thank you Mat.

Since I have been using some cloud GPU resources, I cannot grab an A4000 presently, so cannot try environmental variables you mentioned. However, when I was running my code with different input files (their mesh sizes do not differ a lot), one succeeded and one failed. The one failed happened at this loop:

!$ACC kernels
!$ACC loop independent
WALL_LOOP1: DO IW=1,N_EXTERNAL_WALL_CELLS+N_INTERNAL_WALL_CELLS
WC => WALL(IW)
IF (WC%BOUNDARY_TYPE==NULL_BOUNDARY) CYCLE WALL_LOOP1
BC => BOUNDARY_COORD(WC%BC_INDEX)
BR => BOUNDARY_RADIA(WC%BR_INDEX)
B1 => BOUNDARY_PROP1(WC%B1_INDEX)
IOR = BC%IOR
IF (DLN(IOR,N) < 0._EB) CYCLE WALL_LOOP1
II = BC%II
JJ = BC%JJ
KK = BC%KK
IF (.NOT.TWO_D .OR. ABS(IOR)/=2) THEN
SELECT CASE (WC%BOUNDARY_TYPE)
CASE (OPEN_BOUNDARY)
VT => VENTS(WC%VENT_INDEX)
IF (VT%TMP_EXTERIOR>0._EB) THEN
TSI = T - T_BEGIN
TMP_EXTERIOR = TMP_0(KK)+EVALUATE_RAMP(TSI,VT%TMP_EXTERIOR_RAMP_INDEX)(VT%TMP_EXTERIOR-TMP_0(KK))
IL(II,JJ,KK) = BBFA
RPI_SIGMATMP_EXTERIOR**4
ELSE
IL(II,JJ,KK) = BBFA
RPI_SIGMA*TMPA4
ENDIF
CASE (MIRROR_BOUNDARY)
BR%BAND(IBND)%ILW(N) = BR%BAND(IBND)%ILW(DLM(N,ABS(IOR)))
IL(II,JJ,KK) = BR%BAND(IBND)%ILW(N)
CASE (INTERPOLATED_BOUNDARY)
! IL_R holds the intensities from mesh NOM in the ghost cells of mesh NM.
! IL(II,JJ,KK) is the average of the intensities from the other mesh.
NOM = EXTERNAL_WALL(IW)%NOM
IL(II,JJ,KK) = 0._EB
DO LL=EXTERNAL_WALL(IW)%NIC_MIN,EXTERNAL_WALL(IW)%NIC_MAX
IL(II,JJ,KK) = IL(II,JJ,KK) + OMESH(NOM)%IL_R(LL,N,IBND)
ENDDO
IL(II,JJ,KK) = IL(II,JJ,KK)/REAL(EXTERNAL_WALL(IW)%NIC_MAX-EXTERNAL_WALL(IW)%NIC_MIN+1,EB)

CASE DEFAULT ! solid wall
BR%BAND(IBND)%ILW(N) = OUTRAD_W(IW) + RPI*(1._EB-B1%EMISSIVITY)*INRAD_W(IW)
IC = CELL_INDEX(BC%IIG,BC%JJG,BC%KKG)
CELL_ILW(IC,ABS(IOR)) = BR%BAND(IBND)%ILW(N)
END SELECT
ELSEIF (CYLINDRICAL) THEN
IF (WC%BOUNDARY_TYPE==OPEN_BOUNDARY) CYCLE WALL_LOOP1
IL(II,JJ,KK) = BR%BAND(IBND)%ILW(N)
ENDIF
ENDDO WALL_LOOP1
!$ACC end kernels

With cuda-gdb, the errors are:

CUDA Exception: Warp Illegal Address

Thread 1 “MyApp” received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 2, grid 5, block (0,0,0), thread (64,0,0), device 0, sm 0, warp 0, lane 0]
cuda-gdb/12/gdb/gdbtypes.c:5831: internal-error: copy_type: Assertion `type->is_objfile_owned ()’ failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
----- Backtrace -----
0x000001000037dc18 in radiation_fvm_4045_gpu<<<(10,1,1),(128,1,1)>>> (0x62d087 ???
0x9e1344 ???
0x9e169c ???
0xb94ac1 ???
0x7a3036 ???
0x7a5967 ???
0x7a6469 ???
0x9f6bdb ???
0x9e56fa ???
0x71a479 ???
0x7303c2 ???
0x730a45 ???

0x78f849 ???
0x97071e ???
0x971239 ???
0x972b32 ???
0x973315 ???
0x7daacb ???
0x65ff9a ???
0x7e9813 ???
0x7dc283 ???
0x7e7030 ???
0xb9554c ???
0xb95736 ???
0x828fe4 ???
0x82a8a4 ???
0x5679e4 ???
0x7f93fd3d7082 __libc_start_main
…/csu/libc-start.c:308
0x56d8f4 ???
0xffffffffffffffff ???

These tests are made on P5000, just make it clear.

The above loop is too big, I removed the acc directives, the code can pass this loop , but fails in another small loop:
4362 !$ACC Kernels
4363 !$ACC loop independent
4364 WALL_LOOP2: DO IW=1,N_EXTERNAL_WALL_CELLS+N_INTERNAL_WALL_CELLS
4365 WC => WALL(IW)
4366 IF (WC%BOUNDARY_TYPE==NULL_BOUNDARY) CYCLE WALL_LOOP2
4367 IF (WC%BOUNDARY_TYPE==OPEN_BOUNDARY) CYCLE WALL_LOOP2
4368 BC => BOUNDARY_COORD(WC%BC_INDEX)
4369 BR => BOUNDARY_RADIA(WC%BR_INDEX)
4370 IOR = BC%IOR
4371 IF (TWO_D .AND. .NOT.CYLINDRICAL .AND. ABS(IOR)==2) CYCLE WALL_LOOP2 ! 2-D non cylindrical
4372 IF (DLN(IOR,N)>=0._EB) CYCLE WALL_LOOP2 ! outgoing
4373 IIG = BC%IIG
4374 JJG = BC%JJG
4375 KKG = BC%KKG
4376 INRAD_W(IW) = INRAD_W(IW) + DLN(IOR,N) * BR%BAND(IBND)%ILW(N) ! update incoming rad, step 1
4377 BR%BAND(IBND)%ILW(N) = IL(IIG,JJG,KKG)
4378 INRAD_W(IW) = INRAD_W(IW) - DLN(IOR,N) * BR%BAND(IBND)%ILW(N) ! update incoming rad, step 2
4379 ENDDO WALL_LOOP2
4380 !$ACC end Kernels

The same error occurs.

The interesting thing is that, when I use cuda-gdb, both of my two input files will fail at this loop. However, if I run them directly, one will survive, the other will core dump with the same error.

I’m guessing it is due to the host pointers, which are N_EXTERNAL_WALL_CELLS and N_INTERNAL_WALL_CELLS, being used in the kernel without device memory being allocated?

Thanks.

Sincerely,

Honggang Wang.

Hi Mat,

During this past weekend, I have been testing my code, and found the behavior of pointer association “=>” in the compiler of mpifort shipped with HPC-SDK 23.7 is different from that of GNU mpifort, as shown below:
(1) debugging of the code compiled by GNU mpifort (GNU Fortran (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0):
592 WORK1=>M%WORK1
(gdb) p WORK1
$5 =
(gdb) n
593 WORK2=>M%WORK2
(gdb) p WORK1
$6 = (((0, <repeats 20 times>) (0, <repeats 20 times>) (0, <repeats 20 times>) (0, <repeats 20 times>) (0, <repeats 20 times>) (0, <repeats 20 times>) (0, <repeats 20 times>) (0, <repeats 20 times>) (0, <repeats 20 times>) (0, <repeats 20 times>) …) …)
(gdb) p size(WORK1)
$7 = 3920

As shown above, in line 592, the pointer WORK1 is going to be associated to M%WORK1, Before this action, it is not associated, and after this action, it is associated correctly. This does not happen in the code compiled by mpifort shipped with 23.7, as shown below:

592 WORK1=>M%WORK1
(cuda-gdb) p WORK1
$1 =
(cuda-gdb) n
593 WORK2=>M%WORK2
(cuda-gdb) p WORK1
$2 = (((0)))
(cuda-gdb) p size(WORK1)
$3 = 1
(cuda-gdb)

After the pointer association, the pointer is pointed to an array with one element which is 0.

This comparison can be repeated by:
(1) clone https://github.com/honggangwang1979/fds.git
(2) Install hpc-sdk 23.7:
/content/gdrive/MyDrive/firemodels# cat *sh
curl https://developer.download.nvidia.com/hpc-sdk/ubuntu/DEB-GPG-KEY-NVIDIA-HPC-SDK | gpg --dearmor -o /usr/share/keyrings/nvidia-hpcsdk-archive-keyring.gpg
echo ‘deb [signed-by=/usr/share/keyrings/nvidia-hpcsdk-archive-keyring.gpg] https://developer.download.nvidia.com/hpc-sdk/ubuntu/amd64 /’ | tee /etc/apt/sources.list.d/nvhpc.list
apt-get update -y
apt-get install -y nvhpc-23-7

NVARCH=uname -s_uname -m; export NVARCH
NVCOMPILERS=/opt/nvidia/hpc_sdk; export NVCOMPILERS
MANPATH=$MANPATH:$NVCOMPILERS/$NVARCH/23.7/compilers/man; export MANPATH
PATH=$NVCOMPILERS/$NVARCH/23.7/compilers/bin:$PATH; export PATH
export PATH=$NVCOMPILERS/$NVARCH/23.7/comm_libs/openmpi4/bin:$PATH

(3) go to ~/firemodels/fds/Build/ompi_gnu_linux_db_nomp, and run make.sh. this use the mpifort shipped with HPC-SDK 23.7

(4) copy the compiled fds exe file to ~/firemodels/fds/fds_gpu_test where there is an fds input file simple_test.fds
(5) debug by using : mpirun -n 1 cuda-gdb ./fds* (Here is another difference: in the GNU mpifort version, I can directly run “./fds simple_test.fds” which by default use one process, but in the NVIDIA hpc-sdk mpifort version, I have to use mpirun even for one process)
(6) within cuda-gdb, set: b …/…/Source/init.f90:283, then r ./simple_test.fds, when we reach line 283, set: b …/…/Source/mesh.f90:592, then type “c”, you will go to line 592, then you will see the message I put above in (2)
(7) repeat the process with the GNU mpifort, you will see the message I put above in (1).

Also I have observed the different behaviors of move_alloc( FROM, TO) function compiled by the two different mpifort. In the GNU mpifort, TO is valid after this function, while in nvidia mpifort TO is not correct and may lead to memory mess up.

Those two differences may cause the errors I mentioned earlier in this post.

Have we done some consistency tests of our nvidia hpc-sdk mpifort so that the behavior of it will be the same as the GNU mpifort?

Thank you so much and let me know if you have any questions.

Honggang Wang.

I took a look at the WORK1 issue. I think this is more the debugger not the code. If I print the value of WORK1, it seems to contain the initial values. Are you getting correct results with both compilers?

For the OpenACC code, I found WALL_LOOP1 and WALL_LOOP2 in “Source_ACC/radif.90”, though no OpenACC directives are used. I assuming you added them to a local version.

In OpenACC, scalars are private by default but arrays and pointers are shared. Hence, be sure to put pointers like “WC”, “BC”, “BR”, etc. in a private clause. Similar to what you have in the OpenMP directive. Otherwise, you’ll get race conditions.

Given the complexity of the data structures, it will be a challenge for you to manually manage the data movement. Hence I suggest you try compiling with managed memory (-gpu=managed) enabled. That way teh CUDA driver will take care of the movement for you. While managed only works with allocated data and you still need to manually manage static objects, it looks like objects like “WALL” are allocated so you should be ok.

Thank you very much Mat.

(1) At to the WORK1 issue, the GNU mpifort and the nvidia mpifort have different results when debugging, the GNU mpifort can print expected data as I shown in the last section, while the nvidia hpc-sdk 23.7 mpifort cannot do this - in my case it just points to an array with one element of zero. The code is the same excluding the OpenACC statements ( in the folder of Source) since I want to build a benchmark to make sure the raw code has the same behaviors under both compilers. Because we deeply rely on the debuggers (cuda-gdb or gdb) to confirm us the code is working well, if the debugger fails to print out the expected results, I have less confidence about if the code will work correctly under a specific compiler. But yes, you are right, when I manually added some print statements in the code, I can print the expected data even if the debugger (cuda-gdb) failed to do so.

(2) For the OpenAcc code, yes you are correct, I didn’t update my local codes yesterday. Now I have updated the most recent code to the same repository.

(3) As to make the pointers and arrays private, I can only successfully make one ACC kernels region with private pointers ( If I add arrays to the private list the compiler will either complain it or the code will core dump), as shown in ~/Source_ACC/radi.f90: line 3599 to 3630. If I make it private for other ACC kernels (they are all commented out except the major kernels in line 4213), either the compiler will complain with warnings of “scalar last value needed” or the code will core dump. The current code works well for both of the two input files in ~/firemodels/fds/fds_gpu_test/ACC_P5000/4-rad, namely openmp_test64a.fds (nothing to do with openmp, just has the name) and simple_test.fds. If I uncomment any ACC kernels regions with pointers and/or arrays (whether putting them into the private list or not), the code will core dump for the input file of simple_test.fds (showing : cuEventSynchronize returned error 700: Illegal address during kernel execution) but will survive with the file of openmp_test64a.fds.

Let me know if there are any unclear points.

Thank you for your help!

Sincerely,

Honggang Wang.

Just one note: to compare the ACC code, go to ~/firemodels/fds/Build/ompi_gnu_linux_db_ACC, and run make*.sh, thanks.