Globals initialized with managed pointers breaking at runtime

I originally had a problem with a static library dependency allocating memory in headers and freeing it in the main code, so I had to make sure I could the full library using nvc++. Then, I got strange errors with stack traces like this:

Cuda API error detected: cudaGetFuncBySymbol returned (0x62)
(cuda-gdb) where
#0  0x0000155550173f40 in cudbgReportDriverApiError () from /usr/lib/gcc/x86_64-redhat-linux/8/../../../../lib64/
#1  0x000015555017d691 in ?? () from /usr/lib/gcc/x86_64-redhat-linux/8/../../../../lib64/
#2  0x00001555500ceb66 in ?? () from /usr/lib/gcc/x86_64-redhat-linux/8/../../../../lib64/
#3  0x000015555012fa33 in ?? () from /usr/lib/gcc/x86_64-redhat-linux/8/../../../../lib64/
#4  0x0000155554521025 in cudaGetFuncBySymbol () from /software/sse/manual/nvhpc/21.5-bdist1/Linux_x86_64/21.5/cuda/11.3/lib64/
#5  0x0000155554986cff in __pgi_cuda_get_func_by_symbol (funcHandle=0x7fffffffa920, 
    funcSymbol=0x413060 <__nv__ZN15HaplotypePhaser24CalcSingleScaledBackwardEiPKfPf_F1L518_7()>) at ../../src/cuda_init.c:661
#6  0x00001555549894ea in __pgi_uacc_cuda_load_this_module (dindex=1, error=0, pgi_cuda_loc=0x81a440 <__PGI_CUDA_LOC>) at ../../src/cuda_init.c:1680
#7  0x0000155554989937 in __pgi_uacc_cuda_load_module (dindex=1, error=0) at ../../src/cuda_init.c:1811
#8  0x0000155554ee529c in __pgi_uacc_init_device (dindex=1) at ../../src/init.c:712
#9  0x000015555475ebf7 in do_managed_new(unsigned long, char const*) ()
   from /software/sse/manual/nvhpc/21.5-bdist1/Linux_x86_64/21.5/compilers/lib/
#10 0x000015555475e8c6 in __pgi_managed_array_new () from /software/sse/manual/nvhpc/21.5-bdist1/Linux_x86_64/21.5/compilers/lib/
#11 0x00000000004244a0 in __sti___8_main_cpp_x ()
#12 0x000000000046d9ad in __libc_csu_init ()
#13 0x00001555518b641e in __libc_start_main () from /usr/lib/gcc/x86_64-redhat-linux/8/../../../../lib64/

I realized that this was due to globals being initialized with a call to the managed allocator. In this specific trace, the global is not in the library any more, but in the main source file.

I tried to reduce this to a smaller case, and I was somewhat successful:

#include <stdio.h>
//int* z = new int[1048576];

int main()
    int* x = new int[1048576];
#pragma omp target teams
    for (int k = 0; k < 1048576; k++)
        x[k] *= 3;
    scanf("%d", x);

If this code is compiled with -gpu=cc80,managed,cuda11.0 -mp=gpu -cuda, it succeeds. If I remove the commented line allocating the global z, it fails with the error message:

FATAL ERROR: Can't find a valid cuda module

If I remove the -cuda flag, it succeeds. I cannot do that for my main code, since it relies on Thrust for some specific codepaths. This is not a direct repro, but I do think they are related. Code that is executed in global constructors/initializers seems to go before some needed initialization. This happens even when all the code is in a single source file, as in my repro.

Thanks Carl. I’m not able to recreate the “cudaGetFuncBySymbol” error but instead see a seg fault in the OpenMP runtime library under the same circumstances (i.e. OpenMP with managed memory and the -cuda flag). My assumption segv is related so have filed a problem report, TPR #30982, and sent it to engineering for investigation.


That’s great. Note that I initially just got a seg-fault as well. The more detailed error was only visible when launching with cuda-gdb and tracking the CUDA API failures.