I am porting an application from cuda to openmp offload. While the application works as expected, nvprof/nvvp shows high H2D and D2H calls. I am presuming this is due to the original code using cudaMallocHost whereas openmp offload code uses normal malloc. Is it possible to pin memory on the host for zero copy access with openmp offload?
This application is portable, so I’d like to avoid a solution like mixing cuda code with openmp offload (since amd and intel support is also needed). With clang, this
omp_alloc(size, llvm_omp_target_host_mem_alloc); worked. Is there something similar for nvhpc?
Sorry, no, we don’t have something similar to clang’s “llvm_omp_target_host_mem_alloc” allocator. Though, I added an RFE (TPR#34218) and sent it our engineers for consideration when/if they look to adding custom allocator support.
Though, it’s my understand that it’s cudaHostAlloc with the cudaHostAllocMapped flag is what enables zero-copy access. cudaMallocHost just allocates the host array in pinned memory.
For cudaMallocHost we do have the “-gpu=pinned” flag which will allocate pinned host memory. It can improve performance in data transfers at the cost of a higher allocation time. Though, it only really benefits code that have few allocations but many data transfers. Otherwise our runtime’s default pinned double buffering method works better.