How to map private dynamic array to the GPU with OpenMP and nvc?

FangQ · January 10, 2025, 11:36pm

thank you @MatColgrove again for your prompt and helpful reply.

it appears that malloc()+memset() does work, calloc() is not supported.

I noticed that using this dynamic allocation of thread-private buffer does work for nvc, but it produced a significant performance hit to my program - slowing it down by 5x to 10x compared to a static sized local array, say float buf[10] = {0.f};

You can see this significant speed difference by checking out the latest version of the code that I used in another thread.

here is how I compared the performance hit

github.com

fangq/umcx/blob/10dbca76e8c6f03ad6470c4ce38f636641c7c831/src/umcx.cpp#L659-L664


      
          #ifdef USE_MALLOC
                  float* detphotonbuffer = (float*)malloc(sizeof(float) * detdata.ppathlen);
                  memset(detphotonbuffer, 0, sizeof(float) * detdata.ppathlen);
          #else
                  float detphotonbuffer[10] = {0.f};   // TODO: if changing 10 to detdata.ppathlen, speed of nvc++ built binary drops by 5x to 10x
          #endif

if you download the latest code from GitHub - fangq/umcx: micro mcx, and compile it using make nvc, it uses the static array, then run the benchmark

git clone https://github.com/fangq/umcx.git
cd umcx/src
make clean
make nvc  # use static private array
../bin/umcx --bench cube60 -n 1e7

you can see that the above simulation runs relatively fast (it is about 3x slower than my tests in an earier benchmark due to the template is not working in nvc)

however, if you recompile it using malloc/memset

make clean
make nvc USERCXXFLAGS=-DUSE_MALLOC
../bin/umcx --bench cube60 -n 1e7

you can see that the speed is now 5x to 10x slower.

I am wondering if you can suggest an alternative approach to create a thread-private buffer that does not have such a high overhead?

Topic		Replies	Views
How to use OpenMP map directive to map dynamic array inside a struct/class to the GPU? nvc, nvc++ and nvfortran	16	143	January 17, 2025
Performance drops with dynamic parallelism CUDA Programming and Performance cuda , dynamic-control	12	642	June 3, 2024
Invalid result when using multiple GPUs with openmp threads nvc, nvc++ and nvfortran nvcc	3	33	March 26, 2025
Questions on incorrect results with openacc in GPU nvc, nvc++ and nvfortran	33	2486	December 4, 2023
What is the issue of different values between running the code in serial and run it using OpenACC? Legacy PGI Compilers	15	1527	December 4, 2020
Implicit data copy to device for allocated arrays using compilation option -stdpar=gpu nvc, nvc++ and nvfortran	11	689	May 31, 2023
Call to cuStreamSynchronize returned error 700: Illegal address during kernel execution nvc, nvc++ and nvfortran	14	2956	July 17, 2020
Using Fortran derived types and cuBLAS Legacy PGI Compilers	19	12059	June 24, 2016
OpenACC + CUDA implementation nvc, nvc++ and nvfortran	7	98	January 30, 2025
Using classes in openACC nvc, nvc++ and nvfortran	11	745	March 20, 2023

How to map private dynamic array to the GPU with OpenMP and nvc?

Related topics