Wrong OpenACC data copyin times using pgcc ver 14.3

fchen14 · May 16, 2014, 3:28pm

Hi, All:

I am currently trying to compile and test the PGI compiler for very simple OpenACC performance, however there is one strange number reported from the profiler when using pgcc version 14.3, it seems the profiler is printing some wrong information about the data regions, the v14.3 is giving a strange number of data copy reach times “data copyin reached 8 times” while v 13.7 is giving the correct information “1”, I wonder if anyone can give some help, thanks a lot!

Feng

The code is from cudacast:

Below is the runtime information using pgcc v 13.7 and pgcc v 14.3, the differences are in bold:

[fchen14@shelob001 ep3-first-openacc-program]$ pgcc --version

pgcc 13.7-0 64-bit target on x86-64 Linux -tp sandybridge
Copyright 1989-2000, The Portland Group, Inc. All Rights Reserved.
Copyright 2000-2013, STMicroelectronics, Inc. All Rights Reserved.

[fchen14@shelob001 ep3-first-openacc-program]$ ./a.out
Jacobi relaxation Calculation: 4096 x 4096 mesh
0, 0.250000
100, 0.002397
200, 0.001204
300, 0.000804
400, 0.000603
500, 0.000483
600, 0.000403
700, 0.000345
800, 0.000302
900, 0.000269
total: 5.039815 s

Accelerator Kernel Timing data
/home/fchen14/cudacasts/ep3-first-openacc-program/laplace2d.c
main NVIDIA devicenum=0
time(us): 4,699,770
50: data region reached 1 time
50: data copyin reached 1 time
device time(us): total=22,360 max=22,360 min=22,360 avg=22,360
82: data copyout reached 1 time
device time(us): total=20,009 max=20,009 min=20,009 avg=20,009
56: compute region reached 1000 times
59: kernel launched 1000 times
grid: [32x4094] block: [128]
device time(us): total=2,919,339 max=3,012 min=2,916 avg=2,919
elapsed time(us): total=2,928,986 max=3,025 min=2,926 avg=2,928
59: reduction kernel launched 1000 times
grid: [1] block: [256]
device time(us): total=257,069 max=310 min=255 avg=257
elapsed time(us): total=267,105 max=319 min=264 avg=267
68: compute region reached 1000 times
71: kernel launched 1000 times
grid: [32x4094] block: [128]
device time(us): total=1,480,993 max=1,538 min=1,475 avg=1,480
elapsed time(us): total=1,490,797 max=1,549 min=1,485 avg=1,490

[fchen14@shelob001 ep3-first-openacc-program]$ pgcc --version

pgcc 14.3-0 64-bit target on x86-64 Linux -tp sandybridge
The Portland Group - PGI Compilers and Tools
Copyright (c) 2014, NVIDIA CORPORATION. All rights reserved.
[fchen14@shelob001 ep3-first-openacc-program]$ pgcc laplace2d.c -acc -Minfo=accel -ta=nvidia,time
main:
50, Generating copy(A[:][:])
Generating create(Anew[:][:])
56, Generating NVIDIA code
57, Loop is parallelizable
59, Loop is parallelizable
Accelerator kernel generated
57, #pragma acc loop gang /* blockIdx.y /
59, #pragma acc loop gang, vector(128) / blockIdx.x threadIdx.x /
63, Max reduction generated for error
68, Generating NVIDIA code
69, Loop is parallelizable
71, Loop is parallelizable
Accelerator kernel generated
69, #pragma acc loop gang / blockIdx.y /
71, #pragma acc loop gang, vector(128) / blockIdx.x threadIdx.x */
[fchen14@shelob001 ep3-first-openacc-program]$ ./a.out
Jacobi relaxation Calculation: 4096 x 4096 mesh
0, 0.250000
100, 0.002397
200, 0.001204
300, 0.000804
400, 0.000603
500, 0.000483
600, 0.000403
700, 0.000345
800, 0.000302
900, 0.000269
total: 4.961397 s

Accelerator Kernel Timing data
/home/fchen14/cudacasts/ep3-first-openacc-program/laplace2d.c
main NVIDIA devicenum=0
time(us): 4,545,254
50: data region reached 1 time
50: data copyin reached 8 times
device time(us): total=22,318 max=2,802 min=2,781 avg=2,789
82: data copyout reached 9 times
device time(us): total=20,200 max=2,527 min=13 avg=2,244
56: compute region reached 1000 times
59: kernel launched 1000 times
grid: [32x4094] block: [128]
device time(us): total=2,758,733 max=2,871 min=2,754 avg=2,758
elapsed time(us): total=2,768,433 max=2,885 min=2,764 avg=2,768
59: reduction kernel launched 1000 times
grid: [1] block: [256]
device time(us): total=264,321 max=318 min=262 avg=264
elapsed time(us): total=274,609 max=328 min=272 avg=274
68: compute region reached 1000 times
71: kernel launched 1000 times
grid: [32x4094] block: [128]
device time(us): total=1,479,682 max=1,531 min=1,474 avg=1,479
elapsed time(us): total=1,489,888 max=1,542 min=1,484 avg=1,489

Node hardware information:

±----------------------------------------------------------------------------+
| Compute processes: GPU Memory |
| GPU PID Process name Usage |
|=============================================================================|
| No running compute processes found |
±----------------------------------------------------------------------------+

MatColgrove · May 17, 2014, 12:14am

Hi Feng Chen,

It because the data is being broken up into multiple transfers in order to better utilize the pinned memory buffer. You can increase the buffer size via the environmental variable “PGI_ACC_BUFFERSIZE” if you wish to decrease the number of transfers.

Hope this explains it,
Mat

fchen14 · May 19, 2014, 4:40am

Thank you, Mat, I came across another post with the same isse Number data copyin and copyout unexpected and searched the PGI get started guide:

http://www.pgroup.com/doc/openACC_gs.pdf

There is only one line about this variable on page 23: PGI_ACC_BUFFERSIZE For NVIDIA CUDA devices, this defines the size of the pinned buffer used to transfer data between host and device.

So what is the rule of thumb to set this value? e.g. if I set this buffer to be larger and make the copyin/out time to be one, does this actually increase or decrease the performance of the data transfer? Is there an optimal value for this variable according to my hardware parameters? e.g. GPU memory?

Thank you,

Feng

MatColgrove · May 19, 2014, 11:17pm

Hi Feng,

I haven’t done any performance studies on the optimal buffer size. Even if I did, it would most likely specific to a particular hardware and application so may not be of use. Though, you’re welcome to try various settings to see what impact it has. One thing to keep in mind is that there are actually two buffers. As one is being transferred to the device/host, the second is being copied to/from virtual memory. Hence, the idea size will be based on balancing the transfer time with that of the virtual memory copy time.

Mat

fchen14 · May 21, 2014, 7:17pm

Thank you Mat, this solved my problem.

Topic		Replies	Views
Data copies of the same size vary greatly in different program times nvc, nvc++ and nvfortran	2	323	July 7, 2023
Profiling OpenACC Legacy PGI Compilers	7	3780	May 30, 2019
Openacc Example running slower with GPU nvc, nvc++ and nvfortran	7	904	June 19, 2022
Efficiency of data copyin transfers Legacy PGI Compilers	1	1350	September 4, 2018
Launch of the kernel Legacy PGI Compilers	4	2864	October 18, 2017
optimize runtime Legacy PGI Compilers	1	1336	March 23, 2018
Interpreting output generated by setting PGI_ACC_TIME=1 Legacy PGI Compilers	7	5259	January 8, 2018
OpenACC: Problem with present directive and module array Legacy PGI Compilers	14	9244	August 14, 2012
Strange issues with OpenACC data and loop directives in C++ classes Legacy PGI Compilers	6	1050	January 8, 2021
How should OpenACC handle array size that is only known at run-time Legacy PGI Compilers	8	4109	June 20, 2019

Wrong OpenACC data copyin times using pgcc ver 14.3

Related topics