How to map private dynamic array to the GPU with OpenMP and nvc?

FangQ · January 10, 2025, 6:02am

I would like to create a dynamic float array, with length computed at runtime for each thread (private), and use it inside an #pragma omp target teams distribute parallel for loop, here is a code snippet

#include <stdlib.h>
#include <string.h>

int main(int argn, char *argv[]) {
    int buflen =  atoi(argv[1]);
    float *buf = (float *)calloc(sizeof(float), buflen);

    #pragma omp target teams distribute parallel for map(tofrom: buf[0:buflen]) private(buf)
    for (int i = 0; i < 10; i++) {
         buf[0]++;    // buf is now a thread-private buffer, initialized from host
    }
    free(buf);
    return 0;
}

the building this with nvc++

nvc++  -g -Wall -Wextra -pedantic -O3 -mp=gpu -Minfo=mp,accel -Minline -gpu=mem:managed -static-nvidia -fopenmp  test.c -o test

"test.c", line 8: error: buf is used in multiple data sharing clauses

if I change the omp pragma to

    #pragma omp target teams distribute parallel for private(buf[0:buflen])

then, nvc complains that

"test.c", line 8: error: expected a ")"
      #pragma omp target teams distribute parallel for private(buf[0:buflen])

I also tried moving calloc/free inside the for-loop, then nvc give the following error

main:
      5, #omp target teams distribute parallel for
          5, Generating "nvkernel_main_F1L5_2" GPU kernel
          7, Loop parallelized across teams and threads(128), schedule(static)
nvlink error   : Undefined reference to 'calloc' in '/tmp/nvc++dB4wnj9vADSdI.o'

replacing calloc/free with new float[]{} and delete [], like below

#include <stdlib.h>
#include <string.h>

int main(int argn, char *argv[]) {
    int buflen =  atoi(argv[1]);
    #pragma omp target teams distribute parallel for
    for (int i = 0; i < 10; i++) {
        float *buf = new float[buflen] {};
        buf[0]++;    // buf is now a thread-private buffer, initialized from host
        delete [] buf;
    }
    return 0;
}

then nvc gives the below error

      8, Accelerator restriction: call to '__cxa_throw_bad_array_new_length' with no omp declare target information

what is the correct way to map a dynamic private array to the GPU?

MatColgrove · January 10, 2025, 5:02pm

Hi FangQ,

It’s my understand that, unlike OpenACC, you can’t define the range of an array as part of a private clause, only the pointer itself. Hence, you’d need to allocate the private array on the device using malloc. “new” shouldn’t be used given it’s creating a class object including in this case, an unsupported exception handler. Example below.

Note that device side allocation can be problematic given mallocs get serialized and the default device heap size is quite small. Hence if you start encountering illegal address error it’s likely a heap overflow. You can increase the heap size by either calling cudaDeviceSetLimit, or by setting the environment variable NV_ACC_CUDA_HEAPSIZE. There’s no limit on the head size other than what’s available on the device.

Example:

% cat test.cpp
#include <stdlib.h>
#include <stdio.h>
#include <string.h>

int main(int argn, char *argv[]) {
    int buflen =  atoi(argv[1]);
    float *buf;

    #pragma omp target teams distribute parallel for private(buf)
    for (int i = 0; i < 10; i++) {
         buf = (float*) malloc(sizeof(float)*buflen);
         for (int j=0; j<buflen; ++j) {
             buf[j] = j + (i*10);
         }
         printf("i=%d, buf[3]=%f\n",i,buf[3]);
         free(buf);
    }
    return 0;
}
% nvc++ -mp=gpu test.cpp; a.out 12
i=0, buf[3]=3.000000
i=1, buf[3]=13.000000
i=2, buf[3]=23.000000
i=3, buf[3]=33.000000
i=4, buf[3]=43.000000
i=5, buf[3]=53.000000
i=6, buf[3]=63.000000
i=7, buf[3]=73.000000
i=8, buf[3]=83.000000
i=9, buf[3]=93.000000

Note that if the algorithm allows, it’s often better to use split level parallelism so the private array is private to the team but shared by the threads. This will help reduce the amount of memory used and help with performance since the allocation only done once per team.

% cat test.cpp
#include <stdlib.h>
#include <stdio.h>
#include <string.h>

int main(int argn, char *argv[]) {
    int buflen =  atoi(argv[1]);
    float *buf;

    #pragma omp target teams distribute private(buf)
    for (int i = 0; i < 10; i++) {
         buf = (float*) malloc(sizeof(float)*buflen);
         #pragma omp parallel for
         for (int j=0; j<buflen; ++j) {
             buf[j] = j + (i*10);
         }
         printf("i=%d, buf[3]=%f\n",i,buf[3]);
         free(buf);
    }
    return 0;
}
% nvc++ -mp=gpu test.cpp ; a.out 12
i=4, buf[3]=43.000000
i=6, buf[3]=63.000000
i=7, buf[3]=73.000000
i=8, buf[3]=83.000000
i=0, buf[3]=3.000000
i=1, buf[3]=13.000000
i=3, buf[3]=33.000000
i=9, buf[3]=93.000000
i=2, buf[3]=23.000000
i=5, buf[3]=53.000000

-Mat

FangQ · January 10, 2025, 11:36pm

thank you @MatColgrove again for your prompt and helpful reply.

it appears that malloc()+memset() does work, calloc() is not supported.

I noticed that using this dynamic allocation of thread-private buffer does work for nvc, but it produced a significant performance hit to my program - slowing it down by 5x to 10x compared to a static sized local array, say float buf[10] = {0.f};

You can see this significant speed difference by checking out the latest version of the code that I used in another thread.

here is how I compared the performance hit

github.com

fangq/umcx/blob/10dbca76e8c6f03ad6470c4ce38f636641c7c831/src/umcx.cpp#L659-L664


      
          #ifdef USE_MALLOC
                  float* detphotonbuffer = (float*)malloc(sizeof(float) * detdata.ppathlen);
                  memset(detphotonbuffer, 0, sizeof(float) * detdata.ppathlen);
          #else
                  float detphotonbuffer[10] = {0.f};   // TODO: if changing 10 to detdata.ppathlen, speed of nvc++ built binary drops by 5x to 10x
          #endif

if you download the latest code from GitHub - fangq/umcx: micro mcx, and compile it using make nvc, it uses the static array, then run the benchmark

git clone https://github.com/fangq/umcx.git
cd umcx/src
make clean
make nvc  # use static private array
../bin/umcx --bench cube60 -n 1e7

you can see that the above simulation runs relatively fast (it is about 3x slower than my tests in an earier benchmark due to the template is not working in nvc)

however, if you recompile it using malloc/memset

make clean
make nvc USERCXXFLAGS=-DUSE_MALLOC
../bin/umcx --bench cube60 -n 1e7

you can see that the speed is now 5x to 10x slower.

I am wondering if you can suggest an alternative approach to create a thread-private buffer that does not have such a high overhead?

FangQ · January 10, 2025, 11:37pm

also, as you see in my code, I attempted to use templates to generate faster specialized kernels, but somehow regardless what flags I am using, nvc compiled binary always uses the same number registers and have the same speed. This is different in my CUDA version, as the simplified kernel does have a significant speed improvement. Using -Xptxas -O3,-v nvcc flag I can actually see reduced registers.

any comment on that would also be appreciated. thanks

Update: when I add ptxinfo to -gpu flag, I see ptxas info only prints one single kernel

ptxas info    : 40 bytes gmem
ptxas info    : Compiling entry function 'nvkernel__Z18MCX_run_simulationPPci_F1L658_2' for 'sm_89'
ptxas info    : Function properties for nvkernel__Z18MCX_run_simulationPPci_F1L658_2
    72 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 66 registers, used 1 barriers, 24 bytes smem, 472 bytes cmem[0], 32 bytes cmem[2]

is this because the entire for-loop pass, including all my templated versions are combined into a single kernel?

MatColgrove · January 11, 2025, 12:43am

Not unexpected. Again, the mallocs gets serialized which can have a detrimental impact on performance. It’s generally advised to avoid device side allocation if possible.

There may be a better way in OpenMP to do this, but I’m not aware of it. You might consider switching to OpenACC where you can add the range. In this case the compiler will implicit allocate a block of memory (size of the array X number of private copies) prior to launching the kernel.

Note that “private” arrays are not initialized. Instead, consider using “firstprivate” which will initialize each private copy of the array using the host array’s values. However, there is some extra overhead as the data needs to be copied to the device. It may be better to use “private” and then initialize the private arrays on the device.

Register allocation is performed by ptxas so not something directly under the control of the compiler. Though register usage is often determined by the number of live local variables as well as temps used to hold address calculations. So reducing the number of local variables or splitting large kernels into several smaller kernels can help. Subroutine calls can also drastically increase register usage but all indication are that your subroutines are getting inlined so unlikely a problem.

You can use the flag “-gpu=maxregcount:” to set the maximum number of registers ptxas is allowed to use, though this may result in “spilling” to local memory. I typically only use this if it’s right on the boarder, like 65, and reducing to 64 increases the occupancy. Too much spilling can have a severe negative impact.

Here’s the OpenACC version of the earlier example:

% cat test.acc.cpp
#include <stdlib.h>
#include <stdio.h>
#include <string.h>

int main(int argn, char *argv[]) {
    int buflen =  atoi(argv[1]);
    float *buf = (float *)calloc(sizeof(float), buflen);

//    #pragma omp target teams distribute private(buf)
    #pragma acc parallel loop private(buf[:buflen])
    for (int i = 0; i < 10; i++) {
         for (int j=0; j<buflen; ++j) {
             buf[j] = j + (i*10);
         }
         printf("i=%d, buf[3]=%f\n",i,buf[3]);
    }
    free(buf);
    return 0;
}
% nvc++ -acc=gpu test.acc.cpp -o acc.exe; a.out  12
i=0, buf[3]=3.000000
i=1, buf[3]=13.000000
i=2, buf[3]=23.000000
i=3, buf[3]=33.000000
i=5, buf[3]=53.000000
i=9, buf[3]=93.000000
i=4, buf[3]=43.000000
i=6, buf[3]=63.000000
i=8, buf[3]=83.000000
i=7, buf[3]=73.000000

MatColgrove · January 11, 2025, 12:46am

I should also mention that the flag “-gpu=ptxinfo” will have ptxas display a report of the number of registers used as well as spills and fixed size shared memory usage.

FangQ · January 11, 2025, 4:34am

thanks again for your detailed reply.

regarding the template question I asked earlier, I realized that if I enclose all versions of the template calls to my simulation inside a single omp target teams distribute parallel for section, then nvcc only produces a single GPU kernel, and it still compiles all the code execution paths without having the ability to simplify.

I managed to solve this problem by creating a template version of the host function, which contains the omp target section, like this

template<const bool isreflect, const bool issavedet>
void MCX_kernel(...) {
    #pragma omp target teams distribute parallel for
    for (uint64_t i = 0; i < nphoton; i++) {
         p.run<isreflect, issavedet>(...);
    }
}

by doing this, I found that nvc is able to build one kernel for each templated variation, and reduce registers for simpler kernels.

a side effect, however, is that gcc started complaining

umcx.cpp:625:31: error: ‘inputvol’ appears more than once in map clauses
  625 |     map(to: inputvol) map(to: inputvol.vol[0:inputvol.dimxyzt])

even though it was ok when compiling the code without using the templated MCX_kernel.

FangQ · January 11, 2025, 5:12am

this is helpful. I thought that OpenACC features had been “merged” to OpenMP 5.x, but it sounds like OpenACC still is more developed than OpenMP, and is a few steps ahead.

I remember seeing codes with mixed use of openmp/openacc directives, I am wondering if it is possible to somehow add an OpenACC line just handle this private array?

if not, I agree that switching to OpenACC would be the viable alternative (or using the static array and limit the size).

by the way, I tried replacing the static version float detphotonbuffer[10] = {0.f} to

float detphotonbuffer[detdata.ppathlen] = {0.f}

nvc compiles it ok, but shows a message

641, Accelerator restriction: unsupported statement type: opcode=DEALLOC

when running the compiled binary, it crashed with the following error

Accelerator Fatal Error: Failed to find device function 'nvkernel__Z10MCX_kernelILb1ELb1EEdRN8nlohmann16json_abi_v3_11_3...'! File was compiled with: -gpu=cc89
Rebuild this file with -gpu=cc89 to use NVIDIA Tesla GPU 0

FangQ · January 11, 2025, 8:40pm

following your suggestion, I ventured into writing my first OpenACC code.

I made the following change so that it can support both OMP and ACC

I have two questions

when building my acc version with nvc++, I got a lot of errors, most of comes from the fact that nvc is not happen with some of the complex C++ classes, such as json/std::map stuff. In my OMP code, I intentionally isolated these complex classes with the GPU offloaded codes by only letting gpu to handle the omp declare target/end declare target sections; but in nvc+OpenACC, how do I mark these GPU code boundaries?
another minor issue is that I want to write a macro, such as

#ifndef _OPENACC
    #define _PRAGMA_OMPACC_(settings)   _Pragma("omp " #settings))
#else
    #define _PRAGMA_OMPACC_(settings)   _Pragma("acc " #settings))
#endif

so that I can use _PRAGMA_OMPACC_(atomic capture) to expand to _Pragma("omp atomic capture") or the acc version depends on compilation flag, however, currently, using the above macro is not working. I tried removing the double-quote, but still won’t allow me to do what I wanted. I saw you had a previous comment on this. Can that be extended to attach omp/acc depending on defined(_OPENACC)?

MatColgrove · January 13, 2025, 4:23pm

It’s likely a compiler error related to VLAs. I’m not sure what compiler version you’re using, but have fixed similar issues so consider updating the compiler issue. However, it’s highly recommend to avoid using VLAs in device code. They are implicitly allocated so you run into similar performance issues as you did with the explicit mallocs.

I’ll look at the OpenACC in a few.

MatColgrove · January 13, 2025, 6:02pm

For #1, I’m not seeing these errors, though that may be a difference in the compiler version or the GNU STL being used. The one error I’m seeing is that the compiler is having issues with the length of the firstprivate array being define with a class member which I was able to work around by replacing this with an int. Also, I’m not sure why your using “deviceptr”. This is to define CUDA device pointers, which I don’t see in your code, so I remove them:

#ifdef _OPENACC
    int ppathlen = detdata.ppathlen;   // define a local int for the path length
    float* detphotonbuffer = (float*)calloc(sizeof(float), detdata.ppathlen);
#endif
..
#pragma acc parallel loop gang num_gangs(gridsize) vector_length(blocksize) \
    reduction(+ : energyescape) firstprivate(ran, p) copyin(gcfg, inputvol, detdata) \
    copyin(prop[0:gcfg.mediumnum], detpos[0:gcfg.detnum], inputvol.vol[0:inputvol.dimxyzt]) \
    copy(outputvol, outputvol.vol[0:outputvol.dimxyzt]) \
    copy(detdata, detdata.detphotondata[0:totaldetphotondatalen]) \
    firstprivate(detphotonbuffer[0:ppathlen])

Also, I’m not clear on your description of the error. A declare target region defines which routines need to have device versions created. Though the host versions are still created as well, so it doesn’t isolate them as device only. The OpenACC way do this is to decorate each routine with the “routine” directive, however since the definitions are visible from the scoping unit that they are called in, the compiler can implicitly generate the device versions. You can still decorate them, but it should be necessary.

Now I do see a second error where a device version of runtime routine “__pgi_managed_delete” can’t be found. When “managed” is used, the compiler replaces “new” and “delete” with calls to the managed version. This means that a delete is getting into a device routine. It looks to be coming from “std::map” which likely has a delete in it. Maybe this is what you’re describing?

If I have time later (or you can investigate), I’ll try turning off the auto generation of device routines (-acc=noroutineseq) and then explicitly add the “routine” directive as needed.

For the macro question, here’s how I’d use it:

% cat macro.cpp
#define PRAGMA(x) _Pragma(#x)
#ifndef _OPENACC
    #define _PRAGMA_OMPACC_(settings)   PRAGMA(omp settings)
#else
    #define _PRAGMA_OMPACC_(settings)   PRAGMA(acc settings)
#endif
_PRAGMA_OMPACC_(capture)
% nvc++ -P --no_preincludes macro.cpp ; cat macro.i
_Pragma("omp capture")
% nvc++ -P --no_preincludes macro.cpp -D_OPENACC ; cat macro.i
_Pragma("acc capture")

Note that I’m only not including the pre-include files to make the output easier to read. It’s not something that you want to add to you compilation.

FangQ · January 14, 2025, 3:45am

thank you @MatColgrove again. your comments are really helpful, as always!

the version of nvc/nvc++ is 24.11-0. I run my tests on a Ubuntu 22.04 Linux server (threadripper 3990x + 2080S, as well as another server with 3090 and 4090). the bundled cuda with the nvc is cuda 12.6.

I believe have fixed the compilation errors I had previously encountered (full error log can be downloaded here). The problem was caused by a small pragma acc parallel loop section inside the MCX_userio class, which uses STL and json objects, and this forces nvc building many STL/json.hpp functions and is only meant to be host-only (in omp, I intentionally not use it with target to avoid gpu offloading, but in acc, I don’t know how to force it not to offload). In comparison, using omp, nvc++ also analyzed MCX_userio::initdomain(), but somehow was not bothered by STL and json.hpp (see full log).

After removing this acc parallel loop, together with other changes you suggested, nvc is finally able to build my code with

make nvc BACKEND=acc

when running the compiled binary, the runtime of using this dynamic private buffer (detphotonbuffer) produces no noticeable overhead, similar to the static array (which has a fixed size). I also verified that the data in the output files look correct.

The only issue I noticed is that the reduction variable, energyescape produced incorrect results. For running the cube60 benchmark (../bin/umcx --bench cube60), at the end of the printed log “absorbed …” should be ~17.7%, but now it is 98%; similar, for the cube60b benchmark, the output is also not matching omp outputs.

what else should I add to make this reduction work in acc?

another followup question - I don’t see your acc directives have copyin(pos, dir, seeds), did nvc also find these automatically?

FangQ · January 14, 2025, 3:48am

I don’t know if this is a compiler bug, when I compiled this commit in the openacc branch using make nvc BACKEND=acc, which runs this command

nvc++  -g -Wall -Wextra -pedantic -std=c++11 -O3 -Minfo=mp,accel -Minline -gpu=mem:managed,ptxinfo -static-nvidia -DGPU_OFFLOAD -acc=gpu  -c umcx.cpp -o umcx.o

it produced the following Internal compiler error error, and pointing to this innocent looking line (#639)

NVC++-F-0000-Internal compiler error. Unexpected operation: eok_ref_indirect       0  (umcx.cpp: 639)
NVC++/x86-64-Extractor Linux 24.11-0: compilation aborted
make: *** [Makefile:52: umcx.o] Error 2

PS: never mind, I see your comments on the rationales of defining int ppathlen. I believe the above error is a side effect of the firstprivate handling. After defining int ppathlen, the above error is gone.

MatColgrove · January 14, 2025, 8:48pm

I don’t see your acc directives have copyin(pos, dir, seeds) , did nvc also find these automatically?

I didn’t intentionally leave them off, they were just only included in the deviceptr which I deleted. But yes, the compiler would implicitly copy them, though I re-added them

For the ICE, yes, that’s what I was seeing and is fixed by the work around.

For the wrong answer, it’s not the reduction, it’s the use of a variable for the vector_length. Setting this to a fixed “64” works around the error. I missed it before, but it makes sense since unlike OpenMP “distribute” which outlined and delays the scheduling till runtime, the OpenACC schedule is done at compile time so the vector length needs to be know at compile.

The following works for me. Note that I get better performance if I don’t set the number of gangs. Also, the number of gangs is set at runtime so if you do use num_gangs, it can be a variable.

//#pragma acc parallel loop gang vector num_gangs(gridsize) vector_length(64)
#pragma acc parallel loop gang vector vector_length(64)  \
    reduction(+ : energyescape) firstprivate(ran, p) copyin(gcfg, inputvol, detdata) \
    copyin(prop[0:gcfg.mediumnum], detpos[0:gcfg.detnum], inputvol.vol[0:inputvol.dimxyzt]) \
    copyin(outputvol), copy(outputvol.vol[0:outputvol.dimxyzt]) \
    copyin(detdata), copy(detdata.detphotondata[0:totaldetphotondatalen]) \
    firstprivate(detphotonbuffer[0:ppathlen]) copyin(pos, dir, seeds)

FangQ · January 14, 2025, 11:38pm

With your suggestions on the macros, I was able to combine both omp and acc directives to my main branch (now build with make nvc ACC=on). I am very pleased to see how interchangeable ACC and OMP directives are.

can you say a few words about what went wrong if someone sets vector_length with a variable? does that affect all quantities in the loop or just the reduction variable? using ncu, I see both the gridsize and blocksize are correctly set according to my -t (setting total thread) and -T (setting block size) flags at runtime.

when I remove num_gangs and force vector_length=64, I see nvc++ launches #photons (1e6-1e8) of threads, which is much bigger than the default max thread (100000) I previously used. On your high end H100 GPU, this increased total thread may result in better hardware utility, which makes sense to yield better speed. On my 2080, removing num_gangs is about 2x slower.

MatColgrove · January 16, 2025, 8:37pm

Now I’m speculating a bit here and basing my conclusion off of what I’m observing. I’ve not consulted engineering so may be slightly off. But what I’m seeing is that several of the iterations are getting missed which in turn causes the reduction to not sum all the values. This is likely because of the variable in the vector length causing bad code generation.

It could be a compiler issue, and if I have time, I’ll try to dig further, but more wanted to get you a quick work around so you weren’t blocked.

On my 2080, removing num_gangs is about 2x slower.

I always recommend to only use “num_gangs” as a tuning parameter, which in this case seems like the best option for your target device. Just keep in mind that you may need to revisit it if you change targets.

FangQ · January 16, 2025, 9:58pm

thank you again for sharing the insights.

last night, I prepared some results comparing the speed of my new code (umcx) using nvc+omp, nvc+acc, and comparing those with the CUDA code (mcx).

I am attaching my chart at the end. The findings are similar to what I showed previously, the omp/acc performance is similar to CUDA on the newest GPU, but CUDA seems to work better on older GPUs (now I have implemented comparable features). still, they are way faster than the CPU.

I noticed an unexpected error - for each of the benchmark, I used the -t flag to specify a total thread number that matches the threads computed in the CUDA version; for nvc+omp, everything worked out fine.

however, for the nvc+acc binary, on the 4090, I tried to run ../bin/umcx --bench cube60 -n 5e8 -t 524288 to ask it to simulate using 524288 threads that was used by cuda, but the acc binary crashed with the following error

Failing in Thread:1
Accelerator Fatal Error: call to cuStreamSynchronize returned error 700 (CUDA_ERROR_ILLEGAL_ADDRESS): Illegal address during kernel execution
 File: .../umcx/src/umcx.cpp
 Function: _Z10MCX_kernelILb0ELb1EEdRN8nlohmann16json_abi_v3_11_310basic_jsonINS1_11ordered_mapESt6vectorNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEblmdSaNS1_14adl_serializerES4_IhSaIhEEvEERK9MCX_paramR10MCX_volumeIiERSJ_IfEP6float4P10MCX_mediumR10MCX_detect:707
 Line: 755

I don’t know if this is a compiler bug, or the dynamic buffer somehow exhausts the system resources at this thread number. If I slightly reduce it to say 520336, it runs ok. Again, this only happens on the ACC version. I don’t see this error on other machines I tried.

MatColgrove · January 16, 2025, 10:12pm

My best guess is that the total combined size of the private array is getting above 2GB. The compiler allocates the memory as one large block and then partitions the block amongst the threads.

Try adding “-Mlarge_arrays” which allows 64-bit offsets. There’s a slight overhead for this, and why it’s not on by default, but shouldn’t be too much.

If you’re still doing dynamic allocation on the device, then it could also be a heap overflow. You can increase the heap size via the env var “NV_ACC_CUDA_HEAPSIZE”.

FangQ · January 16, 2025, 10:43pm

I tried both setting NV_ACC_CUDA_HEAPSIZE=512MB and adding -Mlarge_arrays, or both, but the error remains.

using compute-sanitizer, I saw the following curious error message (repeated many times)

========= Invalid __global__ write of size 1 bytes
=========     at double MCX_kernel_726_gpu<(bool)0, (bool)1>(nlohmann::json_abi_v3_11_3::basic_json<nlohmann::json_abi_v3_11_3::ordered_map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, bool, long, unsigned long, double, std::allocator, nlohmann::json_abi_v3_11_3::adl_serializer, std::vector<unsigned char, std::allocator<unsigned char>>, void> &, const MCX_param &, MCX_volume<int> &, MCX_volume<float>&, float4 *, MCX_medium *, MCX_detect &)+0x660 in /drives/taote1/users/fangq/git/Project/github/umcx/src/umcx.cpp:726
=========     by thread (3,0,0) in block (386,0,0)
=========     Address 0x743c03943a30 is out of bounds
=========     and is 226,865 bytes after the nearest allocation at 0x743c038fc000 of size 66,560 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x33255f]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame:cuda_launch3 in ../../src/cuda_launch.c:822 [0x98537]
=========                in /drives/taote1/users/fangq/git/Project/github/umcx/src/../bin/umcx
=========     Host Frame:__pgi_uacc_cuda_launch3 in ../../src/cuda_launch.c:1015 [0x9a789]
=========                in /drives/taote1/users/fangq/git/Project/github/umcx/src/../bin/umcx
=========     Host Frame:__pgi_uacc_cuda_launch in ../../src/cuda_launche.c:169 [0xbbed0]
=========                in /drives/taote1/users/fangq/git/Project/github/umcx/src/../bin/umcx
=========     Host Frame:__pgi_uacc_launch in ../../src/launch.c:54 [0xa3994]
=========                in /drives/taote1/users/fangq/git/Project/github/umcx/src/../bin/umcx
=========     Host Frame:double MCX_kernel<false, true>(nlohmann::json_abi_v3_11_3::basic_json<nlohmann::json_abi_v3_11_3::ordered_map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, unsigned long, double, std::allocator, nlohmann::json_abi_v3_11_3::adl_serializer, std::vector<unsigned char, std::allocator<unsigned char> >, void>&, MCX_param const&, MCX_volume<int>&, MCX_volume<float>&, float4*, MCX_medium*, MCX_detect&) in /drives/taote1/users/fangq/git/Project/github/umcx/src/umcx.cpp:726 [0xe2da]
=========                in /drives/taote1/users/fangq/git/Project/github/umcx/src/../bin/umcx
=========     Host Frame:MCX_run_simulation(char**, int) in /drives/taote1/users/fangq/git/Project/github/umcx/src/umcx.cpp:801 [0x8dba]
=========                in /drives/taote1/users/fangq/git/Project/github/umcx/src/../bin/umcx
=========     Host Frame:main in /drives/taote1/users/fangq/git/Project/github/umcx/src/umcx.cpp:829 [0x14054]
=========                in /drives/taote1/users/fangq/git/Project/github/umcx/src/../bin/umcx
=========     Host Frame:__libc_start_call_main in ../sysdeps/nptl/libc_start_call_main.h:58 [0x29d8f]
=========                in /lib/x86_64-linux-gnu/libc.so.6
=========     Host Frame:__libc_start_main in ../csu/libc-start.c:392 [0x29e3f]
=========                in /lib/x86_64-linux-gnu/libc.so.6
=========     Host Frame:_start [0x6ca4]
=========                in /drives/taote1/users/fangq/git/Project/github/umcx/src/../bin/umcx
========= 
Failing in Thread:1
Accelerator Fatal Error: call to cuStreamSynchronize returned error 719 (CUDA_ERROR_LAUNCH_FAILED): Launch failed (often invalid pointer dereference)
 File: /drives/taote1/users/fangq/git/Project/github/umcx/src/umcx.cpp
 Function: _Z10MCX_kernelILb0ELb1EEdRN8nlohmann16json_abi_v3_11_310basic_jsonINS1_11ordered_mapESt6vectorNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEblmdSaNS1_14adl_serializerES4_IhSaIhEEvEERK9MCX_paramR10MCX_volumeIiERSJ_IfEP6float4P10MCX_mediumR10MCX_detect:707
 Line: 755

========= Target application returned an error
========= ERROR SUMMARY: 57282 errors
========= ERROR SUMMARY: 57182 errors were not printed. Use --print-limit option to adjust the number of printed errors

it points to a global memory write error (1 byte) at line#726 of my code.

interestingly, when this line was called previously in this previous post, your fix was related to defining int ppathlen. I wonder if this thread issue has any connection to that previous issue.

anyways, it is a minor problem. I only bring it up here in case it is helpful for debugging the compiler.

I know I have asked a lot of questions (and learned a lot from you on nvc and acc), I have two minor questions and would like to get your opinions

what is the best way to produce a portable binary that can be executed across multiple generations of GPUs for deployment? With my cuda code, I have been using -arch=sm_30 or whatever that is the lowest arch supported by the cuda version.
If I also want the binary to be independent of the local GLIBC version, is there a flag to statically link libgcc and libstdc++? I’ve already used -static-nvidia. When I add -static to nvc linker, it won’t link with libcuda.

MatColgrove · January 17, 2025, 12:31am

It looks like the error did change. Before it was an illegal address error and now it’s an out-of-bounds error. Granted, you can get an illegal address error if the access is out-of-bounds, so it might be related.

It could really be an out-of-bounds access, so what I typically do is compile the code for the host, then run it through Valgrind’s mem checker. It may or may not be helpful, but might give some clues.

You can give the “-mcmodel=medium” flag a try. This flag implies -Mlarge_array which is for dynamically allocated memory, but also uses the Medium Memory Model allowing 64-bit offsets for data segments (i.e. large static data). I don’t think you use anything like that, so highly doubt it will help, but right now we’re just poking at it.

Stack overflows can also cause illegal address errors. I highly doubt that’s the case here, but try setting NV_ACC_CUDA_STACKSIZE. Now unlike the heap which is unlimited (up to the available device memory), the stack does have a hard limit and the limit depends on the device and CUDA driver version, so I don’t know what that limit is. It will sometimes complain with a runtime error (too many resources) if it’s too big or other times silently set the max limit.

what is the best way to produce a portable binary that can be executed across multiple generations of GPUs for deployment? With my cuda code, I have been using -arch=sm_30 or whatever that is the lowest arch supported by the cuda version.

By default we use RDC to create a target binary for each device with the target defined by the “-gpu=ccXX” flag, where “XX” is the compute capability. This can be a unified binary with multiple targets which you can define as a comma delineated list, “-gpu=cc50,cc60,cc70,…”, or all supported targets “-gpu=ccall”

Alternately, you can disable RDC via “-gpu=nordc”. In this case the PTX gets embedded in the binary and JIT compiled at runtime (this is nvcc’s default). You loose some features with RDC, like calling device subroutines across compilation units or accessing global device data directly from device subroutines.

If I also want the binary to be independent of the local GLIBC version, is there a flag to statically link libgcc and libstdc++? I’ve already used -static-nvidia. When I add -static to nvc linker, it won’t link with libcuda.

While we can link in our static libraries, a completely static binary isn’t really possible anymore. Several system libs only provide a dynamic version. We have the same problem with the compiler. Basically you need to pick the lowest level OS/glibc you want to support and build on that system.

For libgcc and stdlib, no we don’t have a convenience flag for those. You’d likely need to link directly to do that. If you add the verbose flag, “-v”, or the “-dryrun” flag, the you can see the compilation phases including the link line. Now I don’t know what, if any issues you might have statically linking them.

Also, we use the “acclnk” driver to do device and host linking. If you were doing host only, then the normal “ld” linker is used.

Topic		Replies	Views
How to use OpenMP map directive to map dynamic array inside a struct/class to the GPU? nvc, nvc++ and nvfortran	16	143	January 17, 2025
Performance drops with dynamic parallelism CUDA Programming and Performance cuda , dynamic-control	12	646	June 3, 2024
Invalid result when using multiple GPUs with openmp threads nvc, nvc++ and nvfortran nvcc	3	33	March 26, 2025
Questions on incorrect results with openacc in GPU nvc, nvc++ and nvfortran	33	2486	December 4, 2023
What is the issue of different values between running the code in serial and run it using OpenACC? Legacy PGI Compilers	15	1527	December 4, 2020
Implicit data copy to device for allocated arrays using compilation option -stdpar=gpu nvc, nvc++ and nvfortran	11	689	May 31, 2023
Call to cuStreamSynchronize returned error 700: Illegal address during kernel execution nvc, nvc++ and nvfortran	14	2957	July 17, 2020
Using Fortran derived types and cuBLAS Legacy PGI Compilers	19	12059	June 24, 2016
OpenACC + CUDA implementation nvc, nvc++ and nvfortran	7	99	January 30, 2025
Using classes in openACC nvc, nvc++ and nvfortran	11	745	March 20, 2023

How to map private dynamic array to the GPU with OpenMP and nvc?

Related topics