Efficiency of data copyin transfers

Hi all,
With this OpenACC code

#define NUM_T 73728
#define NUM_H 64

	char *in  = calloc(NUM_T * NUM_H * 144, sizeof(char));

    #pragma acc enter data copyin(in[0:NUM_T * NUM_H * 144])
    #pragma acc parallel loop independent vector_length(NUM_H) present(in)
    for (unsigned int t = 0; t < NUM_T; t++)
    {
        #pragma acc loop independent
        for (unsigned int h = 0;  h < NUM_H; h++)
        {
			// Do some stuff ...
		}
	}

I got

Accelerator Kernel Timing data
(unknown)
(unknown) NVIDIA devicenum=0
time(us): 18
0: upload reached 1 time
0: data copyin transfers: 1
device time(us): total=18 max=18 min=18 avg=18
D:\Developpement\OpenACC\TestACC\main.c
main NVIDIA devicenum=0
time(us): 489,931
76: data region reached 1 time
76: data copyin transfers: 41
device time(us): total=489,931 max=12,114 min=6,060 avg=11,949
77: compute region reached 1 time
77: kernel launched 1 time
grid: [65535] block: [64]
elapsed time(us): total=23,000 max=23,000 min=23,000 avg=23,000
77: data region reached 2 times

Why 41 data copyin transfers? no just one ?
Thanks for any help with this.

Hi LeMoussel,

DMA transfers (which is what’s used to transfer data between the device and host) must be done via a CPUs pinned physical memory. To make this more efficient, the compiler uses a double buffering system where the CPU virtual memory is copied to a buffer, that buffer is transferred asynchronously while the second buffer is copying it’s virtual memory.

The 41 transfers correlate to the buffer transfers, not the array.

You can change the buffer size via the environment variable “PGI_ACC_BUFFERSIZE”. See: OpenACC Getting Started :: PGI version 18.7 Documentation for x86 and NVIDIA Processors

Alternatively, you can add the “-ta=tesla:pinned” flag to have the compiler attempt to allocate the array in pinned memory thus eliminating the need for the buffers.

Hope this helps,
Mat