How to use OpenMP map directive to map dynamic array inside a struct/class to the GPU?

I am trying to port a CUDA code (GitHub - fangq/mcx: Monte Carlo eXtreme (MCX) - GPU-accelerated photon transport simulator) to OpenMP+GPU offloading. I am familiar with basic OpenMP but I am totally new to GPU offloading.

My code is in C++ and is currently quite short (~400 lines with comments), you can see the full source code at umcx/src/umcx.cpp at main · fangq/umcx · GitHub.

I was able to build the code on Ubuntu 22.04 Linux server with nvc++ 24-11. The system has a RTX 2080 GPU running on driver 535.146.02, cuda 12.6, with g++ 11/12/13 installed. However, when running the compiled binary on the GPU, I got the following error

Accelerator Fatal Error: call to cuMemcpyDtoHAsync returned error 700 (CUDA_ERROR_ILLEGAL_ADDRESS): Illegal address during kernel execution

here are the commands to replicate this error

git clone https://github.com/fangq/umcx.git
cd umcx/src
make nvc
./umcx cube60  # running the benchmark
compute-sanitizer ./umcx cube60   # check memory error using compute-sanitizer

when I run compute-sanitizer, the memory error appears to originate from reading/writing a dynamic array inside a class, specifically: umcx/src/umcx.cpp at 2a9c6f7316ca059c5ade3b184590198dae370e1e · fangq/umcx · GitHub

The structure of the related class that caused this memory error and the used map settings are shown below

template<class T>
class MCX_volume {
    dim4 size;
    T* vol = nullptr;

  public:
    MCX_volume(uint32_t Nx, uint32_t Ny, uint32_t Nz, uint32_t Nt = 1, T value = 0.0) {
        size = (dim4) {Nx, Ny, Nz, Nt};
        vol = new T[Nx*Ny*Nz*Nt]();
    }
    ~MCX_volume () {
        size = (dim4) {0, 0, 0, 0};
        delete [] vol;
    }
    T& get(int64_t idx) const  { // must be inside the volume
        return vol[idx];
    }
}
void main() {
    ...
    MCX_volume<int> inputvol(io.cfg["Domain"]["Dim"][0], io.cfg["Domain"]["Dim"][1], io.cfg["Domain"]["Dim"][2], 1, 1);
    MCX_volume<float> outputvol(io.cfg["Domain"]["Dim"][0], io.cfg["Domain"]["Dim"][1], io.cfg["Domain"]["Dim"][2]);
    #pragma omp target teams distribute parallel for \
    map(to: inputvol) map(to: prop) map(tofrom: outputvol) map(to: pos) map(to: dir) map(to: seeds) reduction(+ : energyescape)
     for (uint64_t i = 0; i < nphoton; i++) {
        MCX_rand ran(seeds.x ^ i, seeds.y | i, seeds.z ^ i, seeds.w | i);
        MCX_photon p(pos, dir);
        p.run(inputvol, outputvol, prop, ran);
    }
    ...
}

the inputvol object provides a read-only 3-D integer array for the simulation, and outputvol provides a float 3-D array for saving output. I called map(to: inputvol) map(tofrom: outputvol), but it appears it is not enough to map the dynamic array in each object to the GPU.

By reading the OpenMP 5.1 examples document, I saw the declare mapper directive example on Page 181, and another one on 183, it appears that using the following declare mapper statement after the class MCX_volume definition, like this

typedef MCX_volume<int> MCX_inputvol;
typedef MCX_volume<float> MCX_outputvol;
#pragma omp declare mapper(input: MCX_inputvol v) map(v, v.vol[0:v.dimxyzt])
#pragma omp declare mapper(output: MCX_outputvol v) map(v, v.vol[0:v.dimxyzt])

then adding map(mapper(input), to: inputvol) map(mapper(output), tofrom: outputvol) should allow mapping these dynamically allocated buffers to the GPU for reading and reading/writing.

However, when compiling this updated code using nvc++, it complains that " error: invalid text in pragma", as if mapper directive is not supported

nvc++  -g -Wall -Wextra -std=c++14 -O3  -mp=gpu -Minfo=mp,accel -Minline -DGPU_OFFLOAD -fopenmp -c umcx.cpp -o umcx.o
"umcx.cpp", line 102: error: invalid text in pragma
  #pragma omp declare mapper(input: MCX_inputvol v) map(v, v.vol[0:v.dimxyzt])
                            ^
"umcx.cpp", line 103: error: invalid text in pragma
  #pragma omp declare mapper(output: MCX_outputvol v) map(v, v.vol[0:v.dimxyzt])
                            ^
"umcx.cpp", line 404: error: invalid text in pragma
      map(mapper(input), to: inputvol) map(to: prop) map(tofrom: outputvol) map(to: pos) map(to: dir) map(to: seeds) reduction(+ : energyescape)
          ^
"umcx.cpp", line 404: error: invalid text in pragma
      map(mapper(input), to: inputvol) map(to: prop) map(tofrom: outputvol) map(to: pos) map(to: dir) map(to: seeds) reduction(+ : energyescape)
          ^
"umcx.cpp", line 404: error: extra text after expected end of preprocessing directive
      map(mapper(input), to: inputvol) map(to: prop) map(tofrom: outputvol) map(to: pos) map(to: dir) map(to: seeds) reduction(+ : energyescape)

I am wondering what is the right way to map such dynamic buffer to the GPU that is supported by nvc++?

To see the expected output, one can compile using g++ and basic openmp by using

make clean
make
./umcx cube60

this should run properly on a multi-core CPU with multiple threads (to build this without openmp, simply build it with make single)

I am also having a lot of trouble building this code using g++-13 with make nvidia CXX=g++-13

it seems nvc does not yet support mapper(mapid) directive, see HPC Compilers User's Guide Version 24.11 for ARM, OpenPower, x86

regardless, my question still stands - what is the proper omp map to map dynamic array member of a struct or class to the GPU?

The easiest thing to do is use CUDA Managed or Unified memory, i.e. “-gpu=mem:managed” or on systems that support it, “-gpu=mem:unified”. “Managed” implicitly allocates all dynamic data in a unified memory space that’s accessible on both the host and device, while “unified” shares all memory but requires HMM support. This way you don’t need to worry about data movement as the CUDA driver manages it for you. Also if you’re using the STL, like “vector”, UM is really the only way to manage data since your program doesn’t have access to the underlying data,

Sans using UM. the way to do this is via a deep-copy. While I don’t have any tutorials off-hand for OpenMP, OpenMP and OpenACC share the same underlying concepts for data management, so you can take a look at this article. Just change OpenACC’s “copy” clauses with OpenMP’s “map” clauses.

For both, a data create/copy/update is a shallow copy. For aggregate types, it copies the fixed sized data members. For dynamic data types, only the pointer is copied which will be a host address, so you then to copy the aggregate types separately. Provided the parent (i.e. the “this” pointer) of the class is present on the device, the runtime then “attaches” the child to the parent (i.e. fills in the device address of the dynamic type in the parent’s structure).

I wrote a while ago an example container class, “accList.h”, for the book “Parallel Programing with OpenACC” which might help with the concept as well.

1 Like

@MatColgrove, thank you so much for your comment. really helpful.

I first tried -gpu=mem:unified, nvc++ compiled the code without problem, but running it returned an error:

Accelerator Fatal Error: The application was compiled with -gpu=unified, but this platform does not support Unified Memory.

googling unified memory, I noticed that it is only supported on Hopper platforms, which I don’t have.

Then I tried -gpu=mem:managed, nvc++ compiles the code without issue, and the code also runs without any memory error (verified with compute-sanitizer), however, the output is incorrect. It seems that the buffer passed to the GPU are all zeros, they are allocated on the GPU, but somehow did not “copy” the buffer content (especially T inputvol.vol[], which is expected to be all 1s) from the host; because the input buffer is all zero, every photon terminates immediately after launch.

So my follow up question is, when using -gpu=mem:managed, is map(to: inputvol) sufficient to map the dynamically allocated buffer? I added map(to: inputvol.vol[0:inputvol.dimxyzt]) to my openmp directives, it compiles, but still returns zeros for the simulation.

the keyword “deep-copy” you mentioned is really helpful.

I searched deep copy in the OpenMP examples document, Pages 161 and 183, I saw the proper mechanism in OpenMP 5 to enable deep-copy is to use declare mapper(), but unfortunately nvc does not yet support it.

Reading your article related to deep-copy, I saw you mentioned that OpenACC 2.6 supports copying struct-embedded dynamic arrays with struct-member based length, something like

typedef class{
   float *x, *y, *z;
   float coefx, coefy, coefz;
   size_t n;
 } points1;

 points1 p1;
 ...
 #pragma acc enter data copyin(p1.x[0:p1.n], p1.y[0:p1.n], p1.z[0:p1.n])
 #pragma acc enter data copyin(p1) attach(p1.x, p1.y, p1.z)
 #pragma acc parallel loop default(present) ...
 for (i = 0; i < p1.n; ++i) {
   p1.x[i] = ....
 }

I am wondering if you can advise how to properly translate the above OpenACC directives to nvc+OpenMP 5?

In the OpenMP document, I do not see an “attach()” construct. The examples under Section 5.4 “Structure mapping” (Page 163) seems to use map() for such case, but the length is always a constant that is available at compile time, not like a variable like p1.n in your ACC example.

Sorry for the repeated replies, I’ve made some progress following your deep-copy ACC examples.

By using a combination of -gpu=mem:managed with nvc++ and the following OpenMP pragma

    #pragma omp target teams distribute parallel for \
    map(alloc: inputvol.vol)  map(to: inputvol.vol[0:inputvol.dimxyzt]) map(alloc: outputvol.vol) map(tofrom: outputvol.vol[0:outputvol.dimxyzt]) \
    map(to: pos) map(to: dir) map(to: seeds) reduction(+ : energyescape)

I was able to build a binary and produce correct results! The speed I got using the offloaded code is 6189 photon/ms on the 2080S GPU, which is about 3x faster than the multi-threaded version (2382 photon/ms) running with 128 threads on the 64-core Threadripper 3990X CPU on the same server.

the updated code and changes to incorporate this can be found at

The initial speed of this GPU offloaded C++ code is decent, but compared to my CUDA implementation of the same algorithm, with a much more sophisticated kernel, the OpenMP version is about 6x slower than the CUDA version (6189 photon/ms vs 38654 photon/ms). I am sure there is a lot that I can optimize - given now it is working!

I have a number of follow up questions

  1. How to set the total launched GPU threads? my past experience tells me that I need to launch 10s of thousands of threads to saturate the hardware to maximize occupancy. For my CUDA code, the heuristic I used decided to launch 98304 threads with 64 block size. how do I achieve similar configuration in nvc+OpenMP? does nvc has a flag to print thread/block numbers for diagnosis?
  2. how do I map some of the variables to the constant memory?

thanks again!

No, HMM can be supported on x86 systems as well, it just needs a newer Linux kernel installed, though it is advantageous to be on a Grace-Hopper system since data transfers as significantly faster over NVLink. Using it over PCIe can be a bit slow, especially when the GPU is directly accessing the host memory.

For Managed, only dynamically allocated memory is put in Unified Memory. So while dynamic class members would be managed, a static parent object is not so would need to still be added to a map clause. For example you have:

map(alloc: inputvol.vol) map(to: inputvol.vol[0:inputvol.dimxyzt]) map(alloc: outputvol.vol) map(from: outputvol.vol[0:outputvol.dimxyzt])

Given “inputvol” and “outputvol” are static, they needed in a map clause, but the “vol” is dynamic so not. It doesn’t hurt to keep them there, but at runtime they’ll be “present” so the alloc’s become no-ops.

Granted I haven’t tested your code, but you should be able to reduce this to:

map(alloc: inputvol. outputvol)

As for “mapper”, I’m not sure what our plans for it are. It’s basically the same as OpenACC’s shape/policy pragmas which were an experimental construct we proposed a while ago. Though they were never adopted in the standard. There’s merit to using pragmas such as these to define the shape and size of the data structure, but most folks found using explicit deep-copy easier and gave more direct control. Hence the effort was abandoned.

In the OpenMP document, I do not see an “attach()” construct.

Correct, OpenMP didn’t adopt the stand-alone attach but uses the concept when implicitly attaching dynamic data members. The stand-alone attach isn’t used that much, but useful when a program already has device memory allocated that then later needs to be associated. With implicit attach, the parent must be present before creating the child on the device.

How to set the total launched GPU threads?

By default, the compiler runtime will determine the launch configuration based on the loop trip count for the parallel loop or loops if collapsed.

You can override this by setting the “num_teams” (CUDA blocks) and “num_threads” (CUDA threads per block), but in general it’s best to start with the compiler default and then only use these settings as tuning parameters. The compiler doesn’t always choose the optimal schedule, but I found it’s not often that I can beat it, so best to just let it choose.

does nvc has a flag to print thread/block numbers for diagnosis?

Are you looking for just the launch configuration or wanting to get the team/thread id in the code?

For the launch config, run the program under Nsight-Systems and it will show this in the profile. I haven’t tried myself, but calling “omp_get_num_teams” and “omp_get_num_threads” might work. Though be sure to only have one thread print these as it would be a lot of output otherwise.

For the ids, use “omp_get_team_num” and “omp_get_thread_num”. Though given the volume of teams and threads, I wouldn’t advise print all these out.

  1. how do I map some of the variables to the constant memory?

As far as I’m aware, the directive based approaches don’t have ways to manage constant memory. I think the OpenACC technical committee was thinking of ways to support this generically, but I’m not sure about OpenMP. In part constant memory is an NVIDIA feature and the standards tend to steer away from vendor specific items.

The compiler will implicitly use constant memory, primarily for global static read-only data, like Fortran parameters, but overall it’s tricky to manage. Given constant memory is fixed size and silently fails if you exceed the memory, we really needed to limit the use to objects who’s size is known at compile time, to make sure it fits.

One thought, and I should really try it before suggesting it (I haven’t) is that since nvc++ does have limited support for CUDA C++, you might be able to use the “constant” attribute (though add the “-cuda” flag to enable CUDA). The caveat being that you’ve now introduced CUDA to the code thus reducing portability, which is likely why you’ve move to OpenMP.

1 Like

I forgot to ask if you could post the input file? I wanted to take a look to see if I could offer any suggested performance improvements, but couldn’t run the binary due to the missing input.

I forgot to ask if you could post the input file? I wanted to take a look to see if I could offer any suggested performance improvements, but couldn’t run the binary due to the missing input.

I was just writing an update on the thread numbers. But let me first answer your question above - the code itself has a builtin benchmark, you can just type

./umcx cube60

it will create a JSON construct internally, see code umcx/src/umcx.cpp at 692d8ca83f4ed7e3a8e2d3c833c09b873deb1c9b · fangq/umcx · GitHub

I was able to see the thread/block configuration by running nsight-compute, via ncu ./umcx cube60

It appears that the nvc compiled binary always launches one-thread per for-loop pass. In this case, the thread number is nphoton=1e7

    Section: Launch Statistics
    -------------------------------- --------------- ---------------
    Metric Name                          Metric Unit    Metric Value
    -------------------------------- --------------- ---------------
    Block Size                                                    64
    Function Cache Configuration                     CachePreferNone
    Grid Size                                                156,250
    Registers Per Thread             register/thread              62
    Shared Memory Configuration Size           Kbyte           32.77
    Driver Shared Memory Per Block        byte/block               0
    Dynamic Shared Memory Per Block       byte/block               0
    Static Shared Memory Per Block        byte/block               8
    # SMs                                         SM              48
    Threads                                   thread      10,000,000
    Uses Green Context                                             0
    Waves Per SM                                              203.45
    -------------------------------- --------------- ---------------

This is a big issue for this particular code, because each simulated photon follows a random path using a Monte-Carlo method, thus, many of the photons traverses very short distances, while some others traverses very long paths. If launching one thread per photon, this results in poor load balancing.

I was trying to alter the total thread count by adding num_threads(10000) in the OMP pragma, or setting OMP_NUM_THREADS=10000 in the env when running the simulation, however, neither of these settings seems to be effective - nvc++ seems always to force the total number of for-loop count as the thread.

I was able to manually alter this behavior by using a double-loop - conceptually similar to what I did in my CUDA code (outer-loop is a thread kernel)

    #pragma omp target teams distribute parallel for thread_limit(64) ...
    for (uint64_t i = 0; i < nphoton/100; i++) {
      ran.reseed(seeds.x ^ i, seeds.y | i, seeds.z ^ i, seeds.w | i);
      for (int j = 0; j < 100; j++) {
        p.launch(pos, dir);
        p.run(inputvol, outputvol, prop, ran);
        energyescape += p.pos.w;
      }
    }

the above change resulted in a 3x speed increase, bumping speed from 6800 photon/ms to 19,000 photon/ms, as expected. It is still about 2x slower than the full-featured CUDA implementation, but a big improvement nevertheless.

alternatively, i was also able to achieve similar speed up using num_teams() + thread_limit() as you suggested, without needing a double-loop.

    #pragma omp target teams distribute parallel for num_teams(200000/64) thread_limit(64) \
    for (uint64_t i = 0; i < nphoton; i++) {
        ran.reseed(seeds.x ^ i, seeds.y | i, seeds.z ^ i, seeds.w | i);
        p.launch(pos, dir);
        p.run(inputvol, outputvol, prop, ran);
        energyescape += p.pos.w;
    }

I thought num_threads() sets the total thread number in regular OMP pragma. but in nvc, it behaves like teams thread_limit() and sets the block size as you described.

Now, my next question is, in my CUDA code, I used a heuristic to decide total number of threads,

using a formula: totalthread = blocksize * SM_count * max_block_per_SM * (CC > 75 ? 2 : 1)

I am wondering if there is a way I can obtain SM count and CC version in OpenMP? of course, my ultimate goal is to make this num_teams() thread_limit() scalable also across CPUs and other GPU devices, like I did in my OpenCL implementation

Not that I’m aware of. You’ll likely need to call cudaGetDeviceProperties.

1 Like

Happy New Year @MatColgrove.

I made a lot of progress over the last few days, especially following your comment on StackOverflow that makes g++ nvptx offloading finally compile.

My main curiosity at this moment is still to find ways to optimize the nvc++ compiled code in order to match the performance of the CUDA code. Currently there is a 2x speed difference (the CUDA code can be downloaded at GitHub - fangq/mcx: Monte Carlo eXtreme (MCX) - GPU-accelerated photon transport simulator, compile and run with cd src && make && ../bin/mcx --bench cube60).

Yesterday, I tried using template to specialize my simulation functions, with a hope to reduce register count. However, it did not seem to make any difference for some reason.

Currently I have two built-in benchmarks, umcx cube60 that ignores boundary reflection, and umcx cube60b that handles reflection calculations (which involves 3 compute heavy functions).

Using a const template flag, I was hoping that the compiler can strip off those reflect/transmit/reflectcoeff functions and thus cutting down register numbers. When I compile this templated code, I saw nvc++ produced ptx for run<false|true>() and sprint<false|true>(), but when running either of my benchmarks, ncu still reports 68 registers regardless whether reflection is needed or not.

I am wondering if there is any optimization flag that I can pass to nvc++ to use const template to further optimize the code?

I was only able to see reduction of registers from 68 to 60 if I manually delete the reflection related code from the class.

Hi Feng, happy New Year to you as well.

I built your CUDA version and compared it’s “–set=full” Nsight-Compute profile versus the OpenMP version. In all but two key areas, the OpenMP profile actually looks better than the CUDA version. Note that I set the blocks in OpenMP to 7296 to match the CUDA version’s launch configurations, and I’m running on an H100.

OpenMP’s register usage was less (66 vs 80) and had a higher achieved occupancy (39% vs 30%), so trying lowering the register usage isn’t going to help much.

The SOL was better for OpenMP as well with compute at 75% vs 59% and Memory at 19% vs 7%.

Both versions have good cache hit rates at just under 100% for the L2 and 72% (omp) vs 40% (cuda) for the L1. However the volume of data is significantly higher at about~18GB transferred in L2 and ~48GB in L1 with OpenMP vs ~2.5GB and ~5GB with CUDA. Both had low global memory usage (in the MBs).

The main performance issue is the sheer number of instructions executed. OpenMP has about 74 billion vs CUDA at17 billion. FFMA has the largest amount of instructions at 12 Billion vs 800 million. The OpenMP version is doing significantly more work.

I’m not sure why this is, but likely either the workload and/or algorithm are not equivalent.

-Mat

thanks again for your prompt and helpful reply.

I apologize for the confusion - yes, the two benchmarks are currently not equivalent in terms of total simulation and memory need. The current built-in benchmark of umcx/OpenMP simulates 1e7 photons while that of mcx/CUDA simulates 1e6 photons, thus 10x more computation is expected; also, umcx/OpenMP cube60 saves results in 5 time gates, but mcx/CUDA cube60 saves only in a single time gate; therefore, the OpenMP version is expected to use about 5x more global memory for output. (My focus is on the simulation speed (photon/ms) - which is somewhat independent to the photon number difference.)

If you have a high-end GPU, usually 1e6 to 1e7 photons are too small of a workload to use. To make these two simulations equivalent, you need to pull umcx’s code again with this fix of time-gate

then compile umcx using make nvc, and run both jobs with

umcx --bench cube60 -n 1e8
mcx --bench cube60 -n 1e8

the above setting are expected to produce similar compute and memory traffic.

The better metric from umcx/openmp is expected compared to mcx/CUDA because the CUDA kernel is quite complex and feature-rich (~2200 lines, compared to ~270 lines omp-target code in umcx). However, if you compare the simulation speed (printed at the end of either simulation), you should notice that the CUDA code still come out on top, roughly 2x faster on my GPUs.

can you run the above two equivalent workload commands and let me know what do you see in terms of speed (photon/ms)? if the simulation ends too fast, you can also change -n 1e8 to -n 5e8 or 1e9 to further increase photon numbers.

you can also change the benchmark from cube60 to cube60b, you should see more registers are used in mcx/CUDA, but the openmp stays the same.

Update: I compared the openmp and cuda implementations on 3 of my servers on 3 different GPUs, it appears that the speed difference is highest on the 2080S GPU (~2.1 x), but this gap draws closer on the 4070Ti Super; on 4090, the openmp code is actuality 20% faster. All tests ran -n 5e8 photons.

maybe the driver versions have an impact to the openmp performance?

Benchmark Code RTX 2080 S (535.146.02/T# 98304) RTX 4070Ti S (535.183.01/T# 270336) RTX 4090 (535.183.06/T# 524288)
cube60 umcx/OpenMP 33322.15 p/ms 107160.76 p/ms 170453.31 p/ms
mcx/CUDA 86043.71 129769.01 144550.45
cube60b umcx/OpenMP 18990.37 64127.28 109554.96
mcx/CUDA 40990.33 82372.32 96842.92

the number in (…) after each GPU indicates the installed driver version and the T# is the CUDA-derived thread number, which is hard coded in umcx.cpp to match the total thread.

I also want to add that this is not strictly a fair comparison because umcx/OpenMP is missing more than half of the advanced functionalities in the CUDA version. So, it is expected to outperform the CUDA version as it has a much simplified kernel.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.