Hello everyone!
I was testing the interoperability between OpenACC and OpenMP and more specifically, trying to allocate and copy memory to the device using and OpenACC and then using it on an OpenMP target region. Following some examples shown in one of the workshops about OpenMP I particiapted, I created the following example (compiled with NVHPC 21.5 nvc++ -mp=gpu -acc -cuda test.cpp
):
#include <iostream>
#include <omp.h>
#include <openacc.h>
int main() {
int N = 100000;
int *x = new int[N];
std::cout << "Number of devices : " << omp_get_num_devices() << std::endl;
for (int i=0; i<N; ++i)
x[i] = 1;
#pragma acc data copy(x[0:N])
{
#pragma omp target loop
for (int i=0; i<N; ++i)
x[i] = 2;
}
for (int i=0; i<N; ++i)
std::cout << x[i] << " ";
std::cout << std::endl;
delete [] x;
return 0;
}
and observed the following behavior (output of nvprof --print-gpu-trace ./a.out
):
==23910== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput SrcMemType DstMemType Device Context Stream Unified Memory Virtual Address Name
513.74ms 36.128us - - - - - 390.63KB 10.311GB/s Pinned Device Tesla V100-SXM2 1 7 - - [CUDA memcpy HtoD]
534.08ms - - - - - - - - - - - - - PC 0xeb76216f 0x7fff32000000 [Unified Memory CPU page faults]
534.41ms 1.3760us - - - - - 8B 5.5446MB/s Pageable Device Tesla V100-SXM2 1 7 - - [CUDA memcpy HtoD]
535.64ms 2.6240us (782 1 1) (128 1 1) 16 0B 0B - - - - Tesla V100-SXM2 1 17 - - nvkernel_main_F1L14_1 [515]
535.69ms 31.936us - - - - - 390.63KB 11.665GB/s Device Pinned Tesla V100-SXM2 1 7 - - [CUDA memcpy DtoH]
I have 2 questions regarding the above:
- Is there really only one copy done from the Host to the Device?
- Where is the Unified Memory CPU page fault coming from?
I’ve also created a bit more complicated example were I copy again an array x
to the GPU using acc_copyin
and then I also associate another pointer to some part of the copied array.
#include <iostream>
#include <omp.h>
#include <openacc.h>
int main() {
int N = 10;
int *x = new int[N];
std::cout << "Number of devices : " << omp_get_num_devices() << std::endl;
for(int i = 0; i < N; i++) {
x[i] = i;
}
int *x_dev = (int *)acc_copyin(x, N*sizeof(int));
int* y;
omp_target_associate_ptr(y, x_dev, 5*sizeof(int), (N-5)*sizeof(int), 0);
#pragma omp target loop map(x[0:N],y[0:5])
for (int i=0; i<N; ++i) {
x[i] = 1;
if(i<5) {
y[i] += 1;
}
}
#pragma acc exit data copyout(x[0:N])
for (int i=0; i<N; ++i)
std::cout << x[i] << " ";
std::cout << std::endl;
delete [] x;
return 0;
}
The ouput of nvprof
is:
srun -n 1 nvprof --print-gpu-trace ./a.out
==171271== NVPROF is profiling process 171271, command: ./a.out
Number of devices : 1
1 1 1 1 1 2 2 2 2 2
==171271== Profiling application: ./a.out
==171271== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput SrcMemType DstMemType Device Context Stream Unified Memory Virtual Address Name
501.89ms 1.7280us - - - - - 40B 22.076MB/s Pinned Device Tesla V100-SXM2 1 7 - - [CUDA memcpy HtoD]
522.24ms - - - - - - - - - - - - - PC 0xac26816f 0x2aaaf0000000 [Unified Memory CPU page faults]
522.57ms 1.3760us - - - - - 8B 5.5446MB/s Pageable Device Tesla V100-SXM2 1 7 - - [CUDA memcpy HtoD]
523.80ms 1.7280us (1 1 1) (32 1 1) 16 0B 0B - - - - Tesla V100-SXM2 1 14 - - nvkernel_main_F1L20_1 [157]
523.87ms 2.4320us - - - - - 40B 15.685MB/s Device Pinned Tesla V100-SXM2 1 7 - - [CUDA memcpy DtoH]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
My questions in this example are:
- Is again only one HtoD copy of the
x
array really? - Does OpenMP figure out automatically that the pointer
x
is associated to thex_dev
pointer and is already present in the GPU memory using theOpenACC
present table? - Have I understood correctly that this is the proper usage and benefit of
omp_target_associate_ptr
, meaning that it’s used to associate another pointer to the same data existing on the GPU? It also seems to me that this is not needed for thex
array pointer. Am I right?
Thank you very much in advance for your input. It would be very helpful for me to understand how the data transfers are taken care by OpenACC and OpenMP to apply changes in existing code.