Wrong H2D/D2H data transfer sizes?

benry · March 10, 2014, 2:57pm

Hi,
I noticed a strange behavior while debugging a piece of code using OpenACC directives; I therefore reproduced the same behavior on a simple program which load a buffer on the GPU, execute a simple increment on each cell of the loaded array and than copy the buffer back.
In particular, enabling some verbose outputs setting the following variables:

export PGI_ACC_TIME=1
export PGI_ACC_NOTIFY=3

I noticed that for “large” buffers, multiple H2D and D2H data transfers are reported, each of them being as large as the whole buffer, as if the whole buffer was copied in and out multiple times. Interestingly the times the buffer is copied seems to be proportional to the buffer size.
For example for a 67108864 Bytes buffer (which should be copied one time in and one time out), I get the following output:

upload CUDA data  file=incrmat.c function=main line=75 device=0 variable=M bytes=67108864
upload CUDA data  file=incrmat.c function=main line=75 device=0 variable=M bytes=67108864
upload CUDA data  file=incrmat.c function=main line=75 device=0 variable=M bytes=67108864
upload CUDA data  file=incrmat.c function=main line=75 device=0 variable=M bytes=67108864
launch CUDA kernel  file=incrmat.c function=compute line=27 device=0 num_gangs=65536 num_workers=1 vector_length=128 grid=8x8192 block=128
download CUDA data  file=incrmat.c function=main line=84 device=0 variable=M bytes=67108864
download CUDA data  file=incrmat.c function=main line=84 device=0 variable=M bytes=67108864
download CUDA data  file=incrmat.c function=main line=84 device=0 variable=M bytes=67108864
download CUDA data  file=incrmat.c function=main line=84 device=0 variable=M bytes=67108864
download CUDA data  file=incrmat.c function=main line=84 device=0 variable=M bytes=67108864

The multiple copy operations seem to be confirmed also by this output, showing that copyin and copyout have been reached multiple times:

    75: data region reached 1 time
        75: data copyin reached 4 times
             device time(us): total=11,263 max=2,829 min=2,809 avg=2,815
        84: data copyout reached 5 times
             device time(us): total=10,092 max=2,527 min=20 avg=2,018

On the other side, after further investigations, using the NVIDIA nvprof tool on the same executable, although multiple copy operations are confirmed, the sizes of each operation this time seem to be coherent with a division of the same buffer in different chunks:

==72903== Profiling application: ./incrmat
==72903== Profiling result:
   Start  Duration    Grid Size   Block Size   Regs*   SSMem*   DSMem*  Size  Throughput  Device  Context  Stream  Name
322.74ms  2.7964ms  -   -   -   -   -  16.777MB  5.9996GB/s   Tesla K20s (0) 1  2  [CUDA memcpy HtoD]
330.12ms  2.7971ms  -   -   -   -   -  16.777MB  5.9980GB/s   Tesla K20s (0) 1  2  [CUDA memcpy HtoD]
337.47ms  2.7986ms  -   -   -   -   -  16.777MB  5.9950GB/s   Tesla K20s (0) 1  2  [CUDA memcpy HtoD]
344.80ms  2.8016ms  -   -   -   -   -  16.777MB  5.9884GB/s   Tesla K20s (0) 1  2  [CUDA memcpy HtoD]
348.31ms  879.84us   ... Kernel ...
349.25ms  3.0400us  -   -   -   -   -      256B  84.211MB/s   Tesla K20s (0) 1  2  [CUDA memcpy DtoH]
349.29ms  2.5008ms  -   -   -   -   -  16.777MB  6.7088GB/s   Tesla K20s (0) 1  2  [CUDA memcpy DtoH]
356.35ms  2.5007ms  -   -   -   -   -  16.777MB  6.7090GB/s   Tesla K20s (0) 1  2  [CUDA memcpy DtoH]
363.42ms  2.5000ms  -   -   -   -   -  16.777MB  6.7110GB/s   Tesla K20s (0) 1  2  [CUDA memcpy DtoH]
370.52ms  2.5005ms  -   -   -   -   -  16.777MB  6.7096GB/s   Tesla K20s (0) 1  2  [CUDA memcpy DtoH]

In conclusion, I guess to be pretty confident about the fact that the buffer is not actually transferred multiple times, but it is just divided in multiple chunks; despite of this I still have some questions:

Is the transferred data size reported by the verbose output enabled by PGI wrong? Or am I misinterpreting its meaning?
Isn’t the “data copyin reached 4 times” and “data copyout reached 5 times” a misleading information? It has been quite hard for me to understand what was actually happening.
What is the 256B D2H data transfer happening before the buffer copyout?

Thanks in advance,

Enrico

MatColgrove · March 11, 2014, 12:26am

Hi Enrico,

Is the transferred data size reported by the verbose output enabled by PGI wrong? Or am I misinterpreting its meaning?

I think Notify is just printing out the total size and not the actual size copied. I’ll see about getting Notify’s output fixed.

There is buffering occurring, but it’s only part of the array, not the whole thing. You can confirm this by setting “PGI_ACC_DEBUG=1”.

Isn’t the “data copyin reached 4 times” and “data copyout reached 5 times” a misleading information? It has been quite hard for me to understand what was actually happening.

Good point. I’ll see what we can do.

Note that I added TPR#19981 to track this.

What is the 256B D2H data transfer happening before the buffer copyout?

I’m not sure. Do you have a reduction variable that could be being copied out?

Mat

mwolfe · March 11, 2014, 7:55pm

The short first download is because of how asynchronous data transfers are managed in the OpenACC runtime. It allocates two 16MB buffers in pinned memory to manage the downloads. But the downloads will be asynchronous, so it keeps a descriptor in the same buffers, so that when the asynchronous transfer is done it can be copied to the right place in user memory. That descriptor takes up some space in the buffers, so not quite all 16MB are available. So, when you have 64MB of data, it will be split into five chunks, four of which are almost 16MB and one which will be much smaller. The runtime transfers the small chunk first because it simplifies some logic in the transfer loop.

benry · March 12, 2014, 4:31pm

Ok,
thank you very much for your help.

Anyhow, just to answer to the previos question, in the loop there isn’t any reduction variable; thus I guess the reason for the small data transfer should be as pointed out by Michael.

Thanks again,

bye,

Enrico

tull · May 2, 2014, 8:27pm

TPR 19981 - UF: PGI_ACC_NOTIFY and PGI_ACC_TIME give confusing information for large arrays

is fixed in the current 14.4 release.

thanks,
dave

tull · May 2, 2014, 8:56pm

TPR 19981 - UF: PGI_ACC_NOTIFY and PGI_ACC_TIME give confusing information for large arrays

is fixed in the current 14.4 release.

thanks,
dave

Topic		Replies	Views
Wrong OpenACC data copyin times using pgcc ver 14.3 Legacy PGI Compilers	4	10934	May 21, 2014
Data transfers are slower when overlapped than when running sequentially CUDA Programming and Performance	9	1415	September 29, 2021
Interpreting output generated by setting PGI_ACC_TIME=1 Legacy PGI Compilers	7	5259	January 8, 2018
"ECC error" and device to host data transfer quest Legacy PGI Compilers	9	7487	December 21, 2011
analysis of memory usage on GPU Legacy PGI Compilers	4	5727	March 15, 2016
Efficiency of data copyin transfers Legacy PGI Compilers	1	1350	September 4, 2018
concurrent D2H+H2D transfers? CUDA Programming and Performance	5	2473	May 10, 2016
Data copies of the same size vary greatly in different program times nvc, nvc++ and nvfortran	2	324	July 7, 2023
Overlapping acc enter data with computation Legacy PGI Compilers	5	5316	October 6, 2016
PCI Express Latency and how to decrease it CUDA Programming and Performance	7	19357	January 31, 2011

Wrong H2D/D2H data transfer sizes?

Related topics