Efficiency of data copyin transfers

LeMoussel · September 3, 2018, 6:23am

Hi all,
With this OpenACC code

#define NUM_T 73728
#define NUM_H 64

	char *in  = calloc(NUM_T * NUM_H * 144, sizeof(char));

    #pragma acc enter data copyin(in[0:NUM_T * NUM_H * 144])
    #pragma acc parallel loop independent vector_length(NUM_H) present(in)
    for (unsigned int t = 0; t < NUM_T; t++)
    {
        #pragma acc loop independent
        for (unsigned int h = 0;  h < NUM_H; h++)
        {
			// Do some stuff ...
		}
	}

I got

Accelerator Kernel Timing data
(unknown)
(unknown) NVIDIA devicenum=0
time(us): 18
0: upload reached 1 time
0: data copyin transfers: 1
device time(us): total=18 max=18 min=18 avg=18
D:\Developpement\OpenACC\TestACC\main.c
main NVIDIA devicenum=0
time(us): 489,931
76: data region reached 1 time
76: data copyin transfers: 41
device time(us): total=489,931 max=12,114 min=6,060 avg=11,949
77: compute region reached 1 time
77: kernel launched 1 time
grid: [65535] block: [64]
elapsed time(us): total=23,000 max=23,000 min=23,000 avg=23,000
77: data region reached 2 times

Why 41 data copyin transfers? no just one ?
Thanks for any help with this.

MatColgrove · September 4, 2018, 3:38pm

Hi LeMoussel,

DMA transfers (which is what’s used to transfer data between the device and host) must be done via a CPUs pinned physical memory. To make this more efficient, the compiler uses a double buffering system where the CPU virtual memory is copied to a buffer, that buffer is transferred asynchronously while the second buffer is copying it’s virtual memory.

The 41 transfers correlate to the buffer transfers, not the array.

You can change the buffer size via the environment variable “PGI_ACC_BUFFERSIZE”. See: OpenACC Getting Started :: PGI version 18.7 Documentation for x86 and NVIDIA Processors

Alternatively, you can add the “-ta=tesla:pinned” flag to have the compiler attempt to allocate the array in pinned memory thus eliminating the need for the buffers.

Hope this helps,
Mat

Topic		Replies	Views
Wrong OpenACC data copyin times using pgcc ver 14.3 Legacy PGI Compilers	4	10939	May 21, 2014
Number data copyin and copyout unexpected Legacy PGI Compilers	1	4327	April 23, 2014
optimize runtime Legacy PGI Compilers	1	1337	March 23, 2018
Profiling OpenACC Legacy PGI Compilers	7	3784	May 30, 2019
Code not accelerated using acc kernels Legacy PGI Compilers	2	3408	January 30, 2017
OpenACC runtime timings Legacy PGI Compilers	1	2022	August 16, 2012
Interpreting output generated by setting PGI_ACC_TIME=1 Legacy PGI Compilers	7	5266	January 8, 2018
Wrong H2D/D2H data transfer sizes? Legacy PGI Compilers	5	6989	May 2, 2014
Data copies of the same size vary greatly in different program times nvc, nvc++ and nvfortran	2	326	July 7, 2023
Memory copy is not continuous Legacy PGI Compilers	1	1390	September 25, 2018

Efficiency of data copyin transfers

Related topics