Hi,
I noticed a strange behavior while debugging a piece of code using OpenACC directives; I therefore reproduced the same behavior on a simple program which load a buffer on the GPU, execute a simple increment on each cell of the loaded array and than copy the buffer back.
In particular, enabling some verbose outputs setting the following variables:
export PGI_ACC_TIME=1
export PGI_ACC_NOTIFY=3
I noticed that for “large” buffers, multiple H2D and D2H data transfers are reported, each of them being as large as the whole buffer, as if the whole buffer was copied in and out multiple times. Interestingly the times the buffer is copied seems to be proportional to the buffer size.
For example for a 67108864 Bytes buffer (which should be copied one time in and one time out), I get the following output:
upload CUDA data file=incrmat.c function=main line=75 device=0 variable=M bytes=67108864
upload CUDA data file=incrmat.c function=main line=75 device=0 variable=M bytes=67108864
upload CUDA data file=incrmat.c function=main line=75 device=0 variable=M bytes=67108864
upload CUDA data file=incrmat.c function=main line=75 device=0 variable=M bytes=67108864
launch CUDA kernel file=incrmat.c function=compute line=27 device=0 num_gangs=65536 num_workers=1 vector_length=128 grid=8x8192 block=128
download CUDA data file=incrmat.c function=main line=84 device=0 variable=M bytes=67108864
download CUDA data file=incrmat.c function=main line=84 device=0 variable=M bytes=67108864
download CUDA data file=incrmat.c function=main line=84 device=0 variable=M bytes=67108864
download CUDA data file=incrmat.c function=main line=84 device=0 variable=M bytes=67108864
download CUDA data file=incrmat.c function=main line=84 device=0 variable=M bytes=67108864
The multiple copy operations seem to be confirmed also by this output, showing that copyin and copyout have been reached multiple times:
75: data region reached 1 time
75: data copyin reached 4 times
device time(us): total=11,263 max=2,829 min=2,809 avg=2,815
84: data copyout reached 5 times
device time(us): total=10,092 max=2,527 min=20 avg=2,018
On the other side, after further investigations, using the NVIDIA nvprof tool on the same executable, although multiple copy operations are confirmed, the sizes of each operation this time seem to be coherent with a division of the same buffer in different chunks:
==72903== Profiling application: ./incrmat
==72903== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput Device Context Stream Name
322.74ms 2.7964ms - - - - - 16.777MB 5.9996GB/s Tesla K20s (0) 1 2 [CUDA memcpy HtoD]
330.12ms 2.7971ms - - - - - 16.777MB 5.9980GB/s Tesla K20s (0) 1 2 [CUDA memcpy HtoD]
337.47ms 2.7986ms - - - - - 16.777MB 5.9950GB/s Tesla K20s (0) 1 2 [CUDA memcpy HtoD]
344.80ms 2.8016ms - - - - - 16.777MB 5.9884GB/s Tesla K20s (0) 1 2 [CUDA memcpy HtoD]
348.31ms 879.84us ... Kernel ...
349.25ms 3.0400us - - - - - 256B 84.211MB/s Tesla K20s (0) 1 2 [CUDA memcpy DtoH]
349.29ms 2.5008ms - - - - - 16.777MB 6.7088GB/s Tesla K20s (0) 1 2 [CUDA memcpy DtoH]
356.35ms 2.5007ms - - - - - 16.777MB 6.7090GB/s Tesla K20s (0) 1 2 [CUDA memcpy DtoH]
363.42ms 2.5000ms - - - - - 16.777MB 6.7110GB/s Tesla K20s (0) 1 2 [CUDA memcpy DtoH]
370.52ms 2.5005ms - - - - - 16.777MB 6.7096GB/s Tesla K20s (0) 1 2 [CUDA memcpy DtoH]
In conclusion, I guess to be pretty confident about the fact that the buffer is not actually transferred multiple times, but it is just divided in multiple chunks; despite of this I still have some questions:
- Is the transferred data size reported by the verbose output enabled by PGI wrong? Or am I misinterpreting its meaning?
- Isn’t the “data copyin reached 4 times” and “data copyout reached 5 times” a misleading information? It has been quite hard for me to understand what was actually happening.
- What is the 256B D2H data transfer happening before the buffer copyout?
Thanks in advance,
Enrico