Without debug flags: cuMemcpyHtoDAsync returned error 719 for declare create variable

Hi,

I’m looking for advice on how to determine if I’m running into issues due to the machine software environment or how to determine the underlying cause.

On a specific machine, I encounter issues when the CPU allocates the memory for a variable in !$acc declare create clause when I compile without debug flags. I can run the same code with same makefile in a Docker environment with matching compiler and CUDA versions (drivers are different). The nvhpc version is 23.9 and using cuda 12.2 (reported from nvidia-smi).

When I use memcheck tool from compute-sanitizer it provides the backtrace reported below. The offending line number is:

allocate(zlak(1:nlevlak))

zlak is a global variable declared in the head of the same module via:

real(r8), allocatable :: zlak(:) 
...
!$acc declare create(zlak(:)      )

When I create a unit-test on the same machine, I do not encounter any errors, so that may rule out issues with the software environment but from past experiences, the addition of more modules/kernels can cause issues.

Thanks

Failing in Thread:1
========= Program hit CUDA_ERROR_LAUNCH_FAILED (error 719) due to “unspecified launch failure” on CUDA API call to cuCtxSynchronize.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:cuCtxSynchronize [0x322f2c]
========= in /lib64/libcuda.so.1
========= Host Frame:…/…/src/cuda_error.c:101:__pgi_uacc_cuda_error_handler [0x11214]
========= in /autofs/nccs-svm1_sw/summit/nvhpc_sdk/Linux_ppc64le/23.9/compilers/lib/libacccuda.so
========= Host Frame:…/…/src/cuda_dataup1.c:177:__pgi_uacc_cuda_dataup1 [0xd454]
========= in /autofs/nccs-svm1_sw/summit/nvhpc_sdk/Linux_ppc64le/23.9/compilers/lib/libacccuda.so
========= Host Frame:…/…/src/cuda_static.c:220:__pgi_uacc_cuda_static [0x2a9ac]
========= in /autofs/nccs-svm1_sw/summit/nvhpc_sdk/Linux_ppc64le/23.9/compilers/lib/libacccuda.so
========= Host Frame:…/…/src/cuda_static.c:248:walk_cuda_static [0x2a9f8]
========= in /autofs/nccs-svm1_sw/summit/nvhpc_sdk/Linux_ppc64le/23.9/compilers/lib/libacccuda.so
========= Host Frame:…/…/src/rbtree.c:408:_rb_walk [0x60de0]
========= in /autofs/nccs-svm1_sw/summit/nvhpc_sdk/Linux_ppc64le/23.9/compilers/lib/libacchost.so
========= Host Frame:…/…/src/rbtree.c:404:_rb_walk [0x60da4]
========= in /autofs/nccs-svm1_sw/summit/nvhpc_sdk/Linux_ppc64le/23.9/compilers/lib/libacchost.so
========= Host Frame:…/…/src/rbtree.c:412:_rb_walk [0x60e2c]
========= in /autofs/nccs-svm1_sw/summit/nvhpc_sdk/Linux_ppc64le/23.9/compilers/lib/libacchost.so
========= Host Frame:…/…/src/rbtree.c:404:_rb_walk [0x60da4]
========= in /autofs/nccs-svm1_sw/summit/nvhpc_sdk/Linux_ppc64le/23.9/compilers/lib/libacchost.so
========= Host Frame:…/…/src/rbtree.c:404:_rb_walk [0x60da4]
========= in /autofs/nccs-svm1_sw/summit/nvhpc_sdk/Linux_ppc64le/23.9/compilers/lib/libacchost.so
========= Host Frame:…/…/src/rbtree.c:404:_rb_walk [0x60da4]
========= in /autofs/nccs-svm1_sw/summit/nvhpc_sdk/Linux_ppc64le/23.9/compilers/lib/libacchost.so
========= Host Frame:…/…/src/rbtree.c:424:__pgi_uacc_rb_walk [0x60ed8]
========= in /autofs/nccs-svm1_sw/summit/nvhpc_sdk/Linux_ppc64le/23.9/compilers/lib/libacchost.so
========= Host Frame:…/…/src/cuda_static.c:295:__pgi_uacc_cuda_static_create [0x2ab2c]
========= in /autofs/nccs-svm1_sw/summit/nvhpc_sdk/Linux_ppc64le/23.9/compilers/lib/libacccuda.so
========= Host Frame:…/…/src/cuda_init.c:2079:__pgi_uacc_cuda_load_this_module [0x28534]
========= in /autofs/nccs-svm1_sw/summit/nvhpc_sdk/Linux_ppc64le/23.9/compilers/lib/libacccuda.so
========= Host Frame:…/…/src/cuda_init.c:2214:pgi_uacc_cuda_load_module [0x28d94]
========= in /autofs/nccs-svm1_sw/summit/nvhpc_sdk/Linux_ppc64le/23.9/compilers/lib/libacccuda.so
========= Host Frame:…/…/src/init.c:806:pgi_uacc_init_device [0x465f8]
========= in /autofs/nccs-svm1_sw/summit/nvhpc_sdk/Linux_ppc64le/23.9/compilers/lib/libacchost.so
========= Host Frame:…/…/src/mirror_allocb.c:61:pgi_uacc_mirror_allocd [0x486b8]
========= in /autofs/nccs-svm1_sw/summit/nvhpc_sdk/Linux_ppc64le/23.9/compilers/lib/libacchost.so
========= Host Frame:/autofs/nccs-svm1_home1/user/master-E3SM/E3SM/components/elm/src/main/elm_varcon.F90:285:elm_varcon_elm_varcon_init
[0x243494]
========= in /gpfs/alpine2/cli180/proj-shared/user/e3sm_runs/gpu_MOF21x100/run/…/bld/e3sm.exe
========= Host Frame:/autofs/nccs-svm1_home1/user/master-E3SM/E3SM/components/elm/src/main/elm_initializeMod.F90:129:elm_initializemod_initialize1
[0x234350]
========= in /gpfs/alpine2/cli180/proj-shared/user/e3sm_runs/gpu_MOF21x100/run/…/bld/e3sm.exe
========= Host Frame:/autofs/nccs-svm1_home1/user/master-E3SM/E3SM/components/elm/src/cpl/lnd_comp_mct.F90:259:lnd_comp_mct_lnd_init_mct
[0xffc7c]
========= in /gpfs/alpine2/cli180/proj-shared/user/e3sm_runs/gpu_MOF21x100/run/…/bld/e3sm.exe
========= Host Frame:/ccs/home/user/master-E3SM/E3SM/driver-mct/main/component_mod.F90:257:component_mod_component_init_cc
[0x3cae4]
========= in /gpfs/alpine2/cli180/proj-shared/user/e3sm_runs/gpu_MOF21x100/run/…/bld/e3sm.exe
========= Host Frame:/ccs/home/user/master-E3SM/E3SM/driver-mct/main/cime_comp_mod.F90:1431:cime_comp_mod_cime_init
[0x22bb8]
========= in /gpfs/alpine2/cli180/proj-shared/user/e3sm_runs/gpu_MOF21x100/run/…/bld/e3sm.exe
========= Host Frame:/ccs/home/user/master-E3SM/E3SM/driver-mct/main/cime_driver.F90:122:MAIN
[0x3b784]
========= in /gpfs/alpine2/cli180/proj-shared/user/e3sm_runs/gpu_MOF21x100/run/…/bld/e3sm.exe
========= Host Frame:…/src-fio/f90main.c:81:main [0x1d364]
========= in /gpfs/alpine2/cli180/proj-shared/user/e3sm_runs/gpu_MOF21x100/run/…/bld/e3sm.exe

Hi Peter,

Is the allocate statement the first time an OpenACC construct encountered (the “declare” directive with an allocatable delays the device data creation until it’s allocated on the host)? The code you show isn’t unique so I doubt it’s problematic, but the stack trace seems to indicate that the device initialization is taking place which occurs the first time OpenACC is encountered.

While I can’t be sure, the trace seems to indicate that some static data is being created on the device with the crash happening when copying the initial data to the device.

Do you have any other “declare” directives, in particular using static data? If so, try adding these to your unit test to see if it triggers the error.

Any additional information or code you can show, may be helpful in tracking down the issue.

-Mat

Yes, there are a lot of variables in declare create directives, including some large user derived type instances, and the largest “unit-test” I used utilizes most of them (about 400 variables). The large number is due to some subroutines being ported using the routine directive, which could be reduced significantly with some work/thought.

This error occurs in the second initialization routine. The first init routine, though, only initializes scalar variables that do have declare directives, and so since those are managed through update directives (later after all initializations), I’m guessing that doesn’t count as encountering openACC directives for the first time.

For the tests, I did put in an explicit “acc_init” routine that isn’t present in the full code. I’ll try adding that to the main code (will need something similar to use multiple gpus anyways).

 5 #ifdef _OPENACC
  4       call acc_init(acc_device_nvidia)
  3       ngpus = acc_get_num_devices(acc_device_nvidia)
  2       if(ngpus == 0) then
  1            stop "Error NO GPUs detected"
28        end if
  1       print *, "There are ",ngpus," gpus present"
  2       mygpu = mod(iam,ngpus)
  3       print *, "I am ",iam," my GPU is ", mygpu, "of",ngpus
  4       call acc_set_device_num(mygpu,acc_device_nvidia)
  5 #endif

Putting the subroutine copied in the previous post, throws an error during the call acc_init(acc_device_nvidia) with the same backtrace as far as I can tell. Backtrace at the end of the post.

I am compiling with cuda also there are some declare copyin directives (do those execute immediately?)

GPU related compiler flags
string(APPEND FFLAGS " -gpu=deepcopy -Minfo=accel -acc -cuda ")
string(APPEND LDFLAGS " -Wl,--allow-multiple-definition -L/usr/local/hdf5/lib -acc -cuda")

========= Program hit CUDA_ERROR_LAUNCH_FAILED (error 719) due to “unspecified launch failure” on CUDA API call to cuMemcpyHtoDAsync_v2.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:cuMemcpyHtoDAsync_v2 [0x38ed4c]
========= in /lib64/libcuda.so.1
========= Host Frame:…/…/src/cuda_dataup1.c:174:__pgi_uacc_cuda_dataup1 [0xd43c]
========= in /autofs/nccs-svm1_sw/summit/nvhpc_sdk/Linux_ppc64le/23.9/compilers/lib/libacccuda.so
========= Host Frame:…/…/src/cuda_static.c:220:__pgi_uacc_cuda_static [0x2a9ac]
========= in /autofs/nccs-svm1_sw/summit/nvhpc_sdk/Linux_ppc64le/23.9/compilers/lib/libacccuda.so
========= Host Frame:…/…/src/cuda_static.c:248:walk_cuda_static [0x2a9f8]
========= in /autofs/nccs-svm1_sw/summit/nvhpc_sdk/Linux_ppc64le/23.9/compilers/lib/libacccuda.so
========= Host Frame:…/…/src/rbtree.c:408:_rb_walk [0x60de0]
========= in /autofs/nccs-svm1_sw/summit/nvhpc_sdk/Linux_ppc64le/23.9/compilers/lib/libacchost.so
========= Host Frame:…/…/src/rbtree.c:404:_rb_walk [0x60da4]
========= in /autofs/nccs-svm1_sw/summit/nvhpc_sdk/Linux_ppc64le/23.9/compilers/lib/libacchost.so
========= Host Frame:…/…/src/rbtree.c:412:_rb_walk [0x60e2c]
========= in /autofs/nccs-svm1_sw/summit/nvhpc_sdk/Linux_ppc64le/23.9/compilers/lib/libacchost.so
========= Host Frame:…/…/src/rbtree.c:404:_rb_walk [0x60da4]
========= in /autofs/nccs-svm1_sw/summit/nvhpc_sdk/Linux_ppc64le/23.9/compilers/lib/libacchost.so
========= Host Frame:…/…/src/rbtree.c:404:_rb_walk [0x60da4]
========= in /autofs/nccs-svm1_sw/summit/nvhpc_sdk/Linux_ppc64le/23.9/compilers/lib/libacchost.so
========= Host Frame:…/…/src/rbtree.c:404:_rb_walk [0x60da4]
========= in /autofs/nccs-svm1_sw/summit/nvhpc_sdk/Linux_ppc64le/23.9/compilers/lib/libacchost.so
========= Host Frame:…/…/src/rbtree.c:424:__pgi_uacc_rb_walk [0x60ed8]
========= in /autofs/nccs-svm1_sw/summit/nvhpc_sdk/Linux_ppc64le/23.9/compilers/lib/libacchost.so
========= Host Frame:…/…/src/cuda_static.c:295:__pgi_uacc_cuda_static_create [0x2ab2c]
========= in /autofs/nccs-svm1_sw/summit/nvhpc_sdk/Linux_ppc64le/23.9/compilers/lib/libacccuda.so
========= Host Frame:…/…/src/cuda_init.c:2079:pgi_uacc_cuda_load_this_module [0x28534]
========= in /autofs/nccs-svm1_sw/summit/nvhpc_sdk/Linux_ppc64le/23.9/compilers/lib/libacccuda.so
========= Host Frame:…/…/src/cuda_init.c:2214:pgi_uacc_cuda_load_module [0x28d94]
========= in /autofs/nccs-svm1_sw/summit/nvhpc_sdk/Linux_ppc64le/23.9/compilers/lib/libacccuda.so
========= Host Frame:…/…/src/init.c:806:pgi_uacc_init_device [0x465f8]
========= in /autofs/nccs-svm1_sw/summit/nvhpc_sdk/Linux_ppc64le/23.9/compilers/lib/libacchost.so
========= Host Frame:…/…/src/acc_init.c:76:acc_init
[0x17c74]
========= in /autofs/nccs-svm1_sw/summit/nvhpc_sdk/Linux_ppc64le/23.9/compilers/lib/libacchost.so
========= Host Frame:/autofs/nccs-svm1_home1/pschwar3/master-E3SM/E3SM/components/elm/src/cpl/lnd_comp_mct.F90:744:lnd_comp_mct_acc_initialization
[0x1038bc]
========= in /gpfs/alpine2/cli180/proj-shared/pschwar3/e3sm_runs/gpu_MOF21x100/run/…/bld/e3sm.exe
========= Host Frame:/autofs/nccs-svm1_home1/pschwar3/master-E3SM/E3SM/components/elm/src/cpl/lnd_comp_mct.F90:176:lnd_comp_mct_lnd_init_mct
[0xff264]
========= in /gpfs/alpine2/cli180/proj-shared/pschwar3/e3sm_runs/gpu_MOF21x100/run/…/bld/e3sm.exe
========= Host Frame:/ccs/home/pschwar3/master-E3SM/E3SM/driver-mct/main/component_mod.F90:257:component_mod_component_init_cc
[0x3cae4]
========= in /gpfs/alpine2/cli180/proj-shared/pschwar3/e3sm_runs/gpu_MOF21x100/run/…/bld/e3sm.exe
========= Host Frame:/ccs/home/pschwar3/master-E3SM/E3SM/driver-mct/main/cime_comp_mod.F90:1431:cime_comp_mod_cime_init
[0x22bb8]
========= in /gpfs/alpine2/cli180/proj-shared/pschwar3/e3sm_runs/gpu_MOF21x100/run/…/bld/e3sm.exe
========= Host Frame:/ccs/home/pschwar3/master-E3SM/E3SM/driver-mct/main/cime_driver.F90:122:MAIN
[0x3b784]
========= in /gpfs/alpine2/cli180/proj-shared/pschwar3/e3sm_runs/gpu_MOF21x100/run/…/bld/e3sm.exe
========= Host Frame:…/src-fio/f90main.c:81:main [0x1d364]
========= in /gpfs/alpine2/cli180/proj-shared/pschwar3/e3sm_runs/gpu_MOF21x100/run/…/bld/e3sm.exe
========= Host Frame:generic_start_main.isra.0 [0x29f5c]
========= in /lib64/glibc-hwcaps/power9/libc-2.28.so
========= Host Frame:__libc_start_main [0x2a0f4]
========= in /lib64/glibc-hwcaps/power9/libc-2.28.so
=========
Failing in Thread:1
========= Program hit CUDA_ERROR_LAUNCH_FAILED (error 719) due to “unspecified launch failure” on CUDA API call to cuCtxSynchronize.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:cuCtxSynchronize [0x322f2c]
========= in /lib64/libcuda.so.1
========= Host Frame:…/…/src/cuda_error.c:101:__pgi_uacc_cuda_error_handler [0x11214]
========= in /autofs/nccs-svm1_sw/summit/nvhpc_sdk/Linux_ppc64le/23.9/compilers/lib/libacccuda.so
========= Host Frame:…/…/src/cuda_dataup1.c:177:__pgi_uacc_cuda_dataup1 [0xd454]
========= in /autofs/nccs-svm1_sw/summit/nvhpc_sdk/Linux_ppc64le/23.9/compilers/lib/libacccuda.so
========= Host Frame:…/…/src/cuda_static.c:220:__pgi_uacc_cuda_static [0x2a9ac]
========= in /autofs/nccs-svm1_sw/summit/nvhpc_sdk/Linux_ppc64le/23.9/compilers/lib/libacccuda.so
========= Host Frame:…/…/src/cuda_static.c:248:walk_cuda_static [0x2a9f8]
========= in /autofs/nccs-svm1_sw/summit/nvhpc_sdk/Linux_ppc64le/23.9/compilers/lib/libacccuda.so
========= Host Frame:…/…/src/rbtree.c:408:_rb_walk [0x60de0]
========= in /autofs/nccs-svm1_sw/summit/nvhpc_sdk/Linux_ppc64le/23.9/compilers/lib/libacchost.so
========= Host Frame:…/…/src/rbtree.c:404:_rb_walk [0x60da4]
========= in /autofs/nccs-svm1_sw/summit/nvhpc_sdk/Linux_ppc64le/23.9/compilers/lib/libacchost.so
========= Host Frame:…/…/src/rbtree.c:412:_rb_walk [0x60e2c]
========= in /autofs/nccs-svm1_sw/summit/nvhpc_sdk/Linux_ppc64le/23.9/compilers/lib/libacchost.so
========= Host Frame:…/…/src/rbtree.c:404:_rb_walk [0x60da4]
========= in /autofs/nccs-svm1_sw/summit/nvhpc_sdk/Linux_ppc64le/23.9/compilers/lib/libacchost.so
========= Host Frame:…/…/src/rbtree.c:404:_rb_walk [0x60da4]
========= in /autofs/nccs-svm1_sw/summit/nvhpc_sdk/Linux_ppc64le/23.9/compilers/lib/libacchost.so
========= Host Frame:…/…/src/rbtree.c:404:_rb_walk [0x60da4]
========= in /autofs/nccs-svm1_sw/summit/nvhpc_sdk/Linux_ppc64le/23.9/compilers/lib/libacchost.so
========= Host Frame:…/…/src/rbtree.c:424:__pgi_uacc_rb_walk [0x60ed8]
========= in /autofs/nccs-svm1_sw/summit/nvhpc_sdk/Linux_ppc64le/23.9/compilers/lib/libacchost.so
========= Host Frame:…/…/src/cuda_static.c:295:__pgi_uacc_cuda_static_create [0x2ab2c]
========= in /autofs/nccs-svm1_sw/summit/nvhpc_sdk/Linux_ppc64le/23.9/compilers/lib/libacccuda.so
========= Host Frame:…/…/src/cuda_init.c:2079:pgi_uacc_cuda_load_this_module [0x28534]
========= in /autofs/nccs-svm1_sw/summit/nvhpc_sdk/Linux_ppc64le/23.9/compilers/lib/libacccuda.so
========= Host Frame:…/…/src/cuda_init.c:2214:pgi_uacc_cuda_load_module [0x28d94]
========= in /autofs/nccs-svm1_sw/summit/nvhpc_sdk/Linux_ppc64le/23.9/compilers/lib/libacccuda.so
========= Host Frame:…/…/src/init.c:806:pgi_uacc_init_device [0x465f8]
========= in /autofs/nccs-svm1_sw/summit/nvhpc_sdk/Linux_ppc64le/23.9/compilers/lib/libacchost.so
========= Host Frame:…/…/src/acc_init.c:76:acc_init
[0x17c74]
========= in /autofs/nccs-svm1_sw/summit/nvhpc_sdk/Linux_ppc64le/23.9/compilers/lib/libacchost.so
========= Host Frame:/autofs/nccs-svm1_home1/pschwar3/master-E3SM/E3SM/components/elm/src/cpl/lnd_comp_mct.F90:744:lnd_comp_mct_acc_initialization
[0x1038bc]
========= in /gpfs/alpine2/cli180/proj-shared/pschwar3/e3sm_runs/gpu_MOF21x100/run/…/bld/e3sm.exe
========= Host Frame:/autofs/nccs-svm1_home1/pschwar3/master-E3SM/E3SM/components/elm/src/cpl/lnd_comp_mct.F90:176:lnd_comp_mct_lnd_init_mct
[0xff264]
========= in /gpfs/alpine2/cli180/proj-shared/pschwar3/e3sm_runs/gpu_MOF21x100/run/…/bld/e3sm.exe
========= Host Frame:/ccs/home/pschwar3/master-E3SM/E3SM/driver-mct/main/component_mod.F90:257:component_mod_component_init_cc
[0x3cae4]
========= in /gpfs/alpine2/cli180/proj-shared/pschwar3/e3sm_runs/gpu_MOF21x100/run/…/bld/e3sm.exe
========= Host Frame:/ccs/home/pschwar3/master-E3SM/E3SM/driver-mct/main/cime_comp_mod.F90:1431:cime_comp_mod_cime_init
[0x22bb8]
========= in /gpfs/alpine2/cli180/proj-shared/pschwar3/e3sm_runs/gpu_MOF21x100/run/…/bld/e3sm.exe
========= Host Frame:/ccs/home/pschwar3/master-E3SM/E3SM/driver-mct/main/cime_driver.F90:122:MAIN
[0x3b784]
========= in /gpfs/alpine2/cli180/proj-shared/pschwar3/e3sm_runs/gpu_MOF21x100/run/…/bld/e3sm.exe
========= Host Frame:…/src-fio/f90main.c:81:main [0x1d364]
========= in /gpfs/alpine2/cli180/proj-shared/pschwar3/e3sm_runs/gpu_MOF21x100/run/…/bld/e3sm.exe
========= Host Frame:generic_start_main.isra.0 [0x29f5c]
========= in /lib64/glibc-hwcaps/power9/libc-2.28.so
========= Host Frame:__libc_start_main [0x2a0f4]
========= in /lib64/glibc-hwcaps/power9/libc-2.28.so
=========
Accelerator Fatal Error: call to cuMemcpyHtoDAsync returned error 719: Launch failed (often invalid pointer dereference)

========= Target application returned an error
========= ERROR SUMMARY: 2 errors

Likely the problem is with the implicit deep copy (i.e. -gpu=deepcopy) of the static UDTs that are put in a “declare create” in the module. The creation of the device side static UDT gets triggered on initialization and I’m guessing that at this point that UDT has allocatables which haven’t been allocated yet. As the runtime does the “rb_walk” through the various levels of the UDT, there might be some garbage in the uninitialized UDT which then causes the error.

First, try compiling without -gpu=deepcopy and see if the error goes away. You’ll likely have other issues later, but this should confirm if my guess is correct.

If this does get past the error, next remove the UDTs from the “declare” directive and instead move them to an unstructured data region (i.e. “enter data copyin(myudt)”) someplace after the host side UDT has been initialized and members have been allocated.

-Mat

Removing the -gpu=deepcopy does not change the behavior. I don’t understand why not being allocated would be an issue for the declare create since I thought allocating memory was a core functionality of that directive. Is that enough to rule out that it’s related at all? Would setting any NV_ACC_DEBUG variables be potentially helpful at all?

I suppose the deepcopy feature can be buggy, but usually any bugs I’ve encountered show up when I try to update or copyin the derived type and I’ll receive a runtime bug saying “encountered unexpected name” (going off memory) and dump the present table.

I’ll experiment with removing some of the declare statements, as maybe one of the more complicated one is causing issues.

Thanks for the ideas

It’s very possible that you’re encountering a bug, so if you are able to pull together a reproducing, I can report it to our team.

Well after lots of commenting out of things, I can get past the device init and run into an expected failure when transferring data. Hopefully I can trace it to a specific variable and/or kernel for simple re-factoring.

As far as bug reporting, this is on a machine (ppc64le) that is going away middle of November and it’s an older compiler version (23.9), so I’m not going to invest a lot of time into refactoring/debugging as it doesn’t appear to be an issue on other machines. We just wanted some more profiling data using a different dataset while we can.

Thanks for talking this through with me and giving me a concrete place to start.