Wrong OpenACC data copyin times using pgcc ver 14.3

Hi, All:

I am currently trying to compile and test the PGI compiler for very simple OpenACC performance, however there is one strange number reported from the profiler when using pgcc version 14.3, it seems the profiler is printing some wrong information about the data regions, the v14.3 is giving a strange number of data copy reach times “data copyin reached 8 times” while v 13.7 is giving the correct information “1”, I wonder if anyone can give some help, thanks a lot!

Feng

The code is from cudacast:

Below is the runtime information using pgcc v 13.7 and pgcc v 14.3, the differences are in bold:

[fchen14@shelob001 ep3-first-openacc-program]$ pgcc --version

pgcc 13.7-0 64-bit target on x86-64 Linux -tp sandybridge
Copyright 1989-2000, The Portland Group, Inc. All Rights Reserved.
Copyright 2000-2013, STMicroelectronics, Inc. All Rights Reserved.

[fchen14@shelob001 ep3-first-openacc-program]$ ./a.out
Jacobi relaxation Calculation: 4096 x 4096 mesh
0, 0.250000
100, 0.002397
200, 0.001204
300, 0.000804
400, 0.000603
500, 0.000483
600, 0.000403
700, 0.000345
800, 0.000302
900, 0.000269
total: 5.039815 s

Accelerator Kernel Timing data
/home/fchen14/cudacasts/ep3-first-openacc-program/laplace2d.c
main NVIDIA devicenum=0
time(us): 4,699,770
50: data region reached 1 time
50: data copyin reached 1 time

device time(us): total=22,360 max=22,360 min=22,360 avg=22,360
82: data copyout reached 1 time
device time(us): total=20,009 max=20,009 min=20,009 avg=20,009
56: compute region reached 1000 times
59: kernel launched 1000 times
grid: [32x4094] block: [128]
device time(us): total=2,919,339 max=3,012 min=2,916 avg=2,919
elapsed time(us): total=2,928,986 max=3,025 min=2,926 avg=2,928
59: reduction kernel launched 1000 times
grid: [1] block: [256]
device time(us): total=257,069 max=310 min=255 avg=257
elapsed time(us): total=267,105 max=319 min=264 avg=267
68: compute region reached 1000 times
71: kernel launched 1000 times
grid: [32x4094] block: [128]
device time(us): total=1,480,993 max=1,538 min=1,475 avg=1,480
elapsed time(us): total=1,490,797 max=1,549 min=1,485 avg=1,490

[fchen14@shelob001 ep3-first-openacc-program]$ pgcc --version

pgcc 14.3-0 64-bit target on x86-64 Linux -tp sandybridge
The Portland Group - PGI Compilers and Tools
Copyright (c) 2014, NVIDIA CORPORATION. All rights reserved.
[fchen14@shelob001 ep3-first-openacc-program]$ pgcc laplace2d.c -acc -Minfo=accel -ta=nvidia,time
main:
50, Generating copy(A[:][:])
Generating create(Anew[:][:])
56, Generating NVIDIA code
57, Loop is parallelizable
59, Loop is parallelizable
Accelerator kernel generated
57, #pragma acc loop gang /* blockIdx.y /
59, #pragma acc loop gang, vector(128) /
blockIdx.x threadIdx.x /
63, Max reduction generated for error
68, Generating NVIDIA code
69, Loop is parallelizable
71, Loop is parallelizable
Accelerator kernel generated
69, #pragma acc loop gang /
blockIdx.y /
71, #pragma acc loop gang, vector(128) /
blockIdx.x threadIdx.x */
[fchen14@shelob001 ep3-first-openacc-program]$ ./a.out
Jacobi relaxation Calculation: 4096 x 4096 mesh
0, 0.250000
100, 0.002397
200, 0.001204
300, 0.000804
400, 0.000603
500, 0.000483
600, 0.000403
700, 0.000345
800, 0.000302
900, 0.000269
total: 4.961397 s

Accelerator Kernel Timing data
/home/fchen14/cudacasts/ep3-first-openacc-program/laplace2d.c
main NVIDIA devicenum=0
time(us): 4,545,254
50: data region reached 1 time
50: data copyin reached 8 times

device time(us): total=22,318 max=2,802 min=2,781 avg=2,789
82: data copyout reached 9 times
device time(us): total=20,200 max=2,527 min=13 avg=2,244
56: compute region reached 1000 times
59: kernel launched 1000 times
grid: [32x4094] block: [128]
device time(us): total=2,758,733 max=2,871 min=2,754 avg=2,758
elapsed time(us): total=2,768,433 max=2,885 min=2,764 avg=2,768
59: reduction kernel launched 1000 times
grid: [1] block: [256]
device time(us): total=264,321 max=318 min=262 avg=264
elapsed time(us): total=274,609 max=328 min=272 avg=274
68: compute region reached 1000 times
71: kernel launched 1000 times
grid: [32x4094] block: [128]
device time(us): total=1,479,682 max=1,531 min=1,474 avg=1,479
elapsed time(us): total=1,489,888 max=1,542 min=1,484 avg=1,489

Node hardware information:

[fchen14@shelob001 ~]$ nvidia-smi
Fri May 16 10:13:35 2014
±-----------------------------------------------------+
| NVIDIA-SMI 5.319.72 Driver Version: 319.72 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K20Xm Off | 0000:2A:00.0 Off | 0 |
| N/A 26C P8 30W / 235W | 14MB / 5759MB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla K20Xm Off | 0000:90:00.0 Off | 0 |
| N/A 26C P8 28W / 235W | 12MB / 5759MB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Compute processes: GPU Memory |
| GPU PID Process name Usage |
|=============================================================================|
| No running compute processes found |
±----------------------------------------------------------------------------+

Hi Feng Chen,

It because the data is being broken up into multiple transfers in order to better utilize the pinned memory buffer. You can increase the buffer size via the environmental variable “PGI_ACC_BUFFERSIZE” if you wish to decrease the number of transfers.

Hope this explains it,
Mat

Thank you, Mat, I came across another post with the same isse Number data copyin and copyout unexpected and searched the PGI get started guide:

http://www.pgroup.com/doc/openACC_gs.pdf

There is only one line about this variable on page 23: PGI_ACC_BUFFERSIZE For NVIDIA CUDA devices, this defines the size of the pinned buffer used to transfer data between host and device.

So what is the rule of thumb to set this value? e.g. if I set this buffer to be larger and make the copyin/out time to be one, does this actually increase or decrease the performance of the data transfer? Is there an optimal value for this variable according to my hardware parameters? e.g. GPU memory?

Thank you,

Feng


Hi Feng,

I haven’t done any performance studies on the optimal buffer size. Even if I did, it would most likely specific to a particular hardware and application so may not be of use. Though, you’re welcome to try various settings to see what impact it has. One thing to keep in mind is that there are actually two buffers. As one is being transferred to the device/host, the second is being copied to/from virtual memory. Hence, the idea size will be based on balancing the transfer time with that of the virtual memory copy time.

  • Mat

Thank you Mat, this solved my problem.