I am trying to port a CUDA code (GitHub - fangq/mcx: Monte Carlo eXtreme (MCX) - GPU-accelerated photon transport simulator) to OpenMP+GPU offloading. I am familiar with basic OpenMP but I am totally new to GPU offloading.
My code is in C++ and is currently quite short (~400 lines with comments), you can see the full source code at umcx/src/umcx.cpp at main · fangq/umcx · GitHub.
I was able to build the code on Ubuntu 22.04 Linux server with nvc++ 24-11. The system has a RTX 2080 GPU running on driver 535.146.02, cuda 12.6, with g++ 11/12/13 installed. However, when running the compiled binary on the GPU, I got the following error
Accelerator Fatal Error: call to cuMemcpyDtoHAsync returned error 700 (CUDA_ERROR_ILLEGAL_ADDRESS): Illegal address during kernel execution
here are the commands to replicate this error
git clone https://github.com/fangq/umcx.git
cd umcx/src
make nvc
./umcx cube60 # running the benchmark
compute-sanitizer ./umcx cube60 # check memory error using compute-sanitizer
when I run compute-sanitizer
, the memory error appears to originate from reading/writing a dynamic array inside a class, specifically: umcx/src/umcx.cpp at 2a9c6f7316ca059c5ade3b184590198dae370e1e · fangq/umcx · GitHub
The structure of the related class that caused this memory error and the used map settings are shown below
template<class T>
class MCX_volume {
dim4 size;
T* vol = nullptr;
public:
MCX_volume(uint32_t Nx, uint32_t Ny, uint32_t Nz, uint32_t Nt = 1, T value = 0.0) {
size = (dim4) {Nx, Ny, Nz, Nt};
vol = new T[Nx*Ny*Nz*Nt]();
}
~MCX_volume () {
size = (dim4) {0, 0, 0, 0};
delete [] vol;
}
T& get(int64_t idx) const { // must be inside the volume
return vol[idx];
}
}
void main() {
...
MCX_volume<int> inputvol(io.cfg["Domain"]["Dim"][0], io.cfg["Domain"]["Dim"][1], io.cfg["Domain"]["Dim"][2], 1, 1);
MCX_volume<float> outputvol(io.cfg["Domain"]["Dim"][0], io.cfg["Domain"]["Dim"][1], io.cfg["Domain"]["Dim"][2]);
#pragma omp target teams distribute parallel for \
map(to: inputvol) map(to: prop) map(tofrom: outputvol) map(to: pos) map(to: dir) map(to: seeds) reduction(+ : energyescape)
for (uint64_t i = 0; i < nphoton; i++) {
MCX_rand ran(seeds.x ^ i, seeds.y | i, seeds.z ^ i, seeds.w | i);
MCX_photon p(pos, dir);
p.run(inputvol, outputvol, prop, ran);
}
...
}
the inputvol
object provides a read-only 3-D integer array for the simulation, and outputvol
provides a float 3-D array for saving output. I called map(to: inputvol) map(tofrom: outputvol)
, but it appears it is not enough to map the dynamic array in each object to the GPU.
By reading the OpenMP 5.1 examples document, I saw the declare mapper
directive example on Page 181, and another one on 183, it appears that using the following declare mapper
statement after the class MCX_volume
definition, like this
typedef MCX_volume<int> MCX_inputvol;
typedef MCX_volume<float> MCX_outputvol;
#pragma omp declare mapper(input: MCX_inputvol v) map(v, v.vol[0:v.dimxyzt])
#pragma omp declare mapper(output: MCX_outputvol v) map(v, v.vol[0:v.dimxyzt])
then adding map(mapper(input), to: inputvol) map(mapper(output), tofrom: outputvol)
should allow mapping these dynamically allocated buffers to the GPU for reading and reading/writing.
However, when compiling this updated code using nvc++, it complains that " error: invalid text in pragma", as if mapper
directive is not supported
nvc++ -g -Wall -Wextra -std=c++14 -O3 -mp=gpu -Minfo=mp,accel -Minline -DGPU_OFFLOAD -fopenmp -c umcx.cpp -o umcx.o
"umcx.cpp", line 102: error: invalid text in pragma
#pragma omp declare mapper(input: MCX_inputvol v) map(v, v.vol[0:v.dimxyzt])
^
"umcx.cpp", line 103: error: invalid text in pragma
#pragma omp declare mapper(output: MCX_outputvol v) map(v, v.vol[0:v.dimxyzt])
^
"umcx.cpp", line 404: error: invalid text in pragma
map(mapper(input), to: inputvol) map(to: prop) map(tofrom: outputvol) map(to: pos) map(to: dir) map(to: seeds) reduction(+ : energyescape)
^
"umcx.cpp", line 404: error: invalid text in pragma
map(mapper(input), to: inputvol) map(to: prop) map(tofrom: outputvol) map(to: pos) map(to: dir) map(to: seeds) reduction(+ : energyescape)
^
"umcx.cpp", line 404: error: extra text after expected end of preprocessing directive
map(mapper(input), to: inputvol) map(to: prop) map(tofrom: outputvol) map(to: pos) map(to: dir) map(to: seeds) reduction(+ : energyescape)
I am wondering what is the right way to map such dynamic buffer to the GPU that is supported by nvc++?