Environment variable PGI_ACC_TIME accelerates process

rob_v8 · October 1, 2021, 1:26pm

Hi,

I have a strange phenomenon, a process that uses only asynchronous data transfers and kernel initiations runs much faster when the environment variable PGI_ACC_TIME is set to 1. Since this gathers timing information on all kernel calls I’d assume there is some overhead and it would run a bit slower, however it consistently runs faster, by about 40%. So my question is what changes when this variable is set that could possible lead to this performance gain?

Kind regards,
Rob

MatColgrove · October 1, 2021, 3:44pm

Hi Rob,

That is odd and counter-intuitive. Our runtime profiler does introduce a bit of overhead so I’d expect a very slight slow down but it’s shouldn’t speed things up.

The only thing I can think of is that ACC_TIME may be forcing synchronization. Can you try using the environment flag “NVCOMPILER_ACC_SYNCHRONOUS=1” (without ACC_TIME) so “async” is disabled? (Note the “PGI” prefix still works, but is deprecated… We also accept the abbreviated “NV” prefix) .

If that’s not it, I’ll need to ask engineering for ideas, though if you could do some profiling to understand where the speed-up is coming from (all kernels? a particular kernel? Data transfer? Something else?) .

If it is async, then perhaps you’re using too many CUDA streams (OpenACC async queues)? Creating streams have a high overhead so can cause slow-downs if too many are used (best to use no more than 4 and re-use them).

-Mat

rob_v8 · October 4, 2021, 11:57am

Hi Mat,

when I set NVCOMPILER_ACC_SYNCHRONOUS to 1 it runs a bit slower with PGI_ACC_TIME set to 1 than without. In this program I am using 8 streams so this could be the reason that it is faster with the profiler turned on.
I will try to reduce the number of streams. My follow-up question, is this overhead of using CUDA streams due to API cost or due to hardware limitations? Meaning will this still be an issue when using a GPU with NVLINK?

Thanks for the help and kind regards,
Rob

MatColgrove · October 4, 2021, 4:15pm

It’s the start-up cost of creating a CUDA Stream.

Meaning will this still be an issue when using a GPU with NVLINK?

I don’t know for sure but I’d speculate that using NVLINK may help with the transfer time and thus offset more of the overhead, but wouldn’t effect the amount of start-up overhead.

rob_v8 · October 14, 2021, 1:10pm

Hi Mat,

I have profiled both cases with nsys and it looks like when PGI_ACC_TIME is set to 1, many kernels are executed in parallel (up to 5) while when it is set to 0 there is almost no concurrent execution of kernels on the GPU. Do you have any idea why this is happening? For me the concurrent execution would be preferable.

thanks,
Rob

MatColgrove · October 14, 2021, 5:08pm

No sorry, I’ve never seen nor heard of anything like this and It’s counter-intuitive. Let me ask engineering for ideas.

MatColgrove · October 15, 2021, 6:30pm

Hi Rob,

I talked with Michael Wolfe and he’s just as puzzled as I am. At this point we’d need a reproducing example, or maybe send me the profiles in case something jumps out?

-Mat

rob_v8 · November 5, 2021, 1:20pm

Hi,

I have switched from NV Compiler suite 20.7 to 21.9, with this newer compiler the issue seems to be fixed.

Regards,
Rob

Topic		Replies	Views
Profiling OpenACC Legacy PGI Compilers	7	3784	May 30, 2019
compiler ask acc routine information for internal function Legacy PGI Compilers	12	20313	October 25, 2017
8x GPU app profiles parallel GPU kernel exec in NVVP, but kernels exec serial from cmd line CUDA Programming and Performance	5	561	September 15, 2020
Very slow performance of some loops Legacy PGI Compilers	3	2809	July 22, 2011
Tutorial2 from PGI - Does-it accelerate ? Legacy PGI Compilers	2	5128	July 13, 2010
profiling individual subroutines Legacy PGI Compilers	1	7179	June 11, 2013
Optimizing one kernel affects the performance of other kernels? CUDA Programming and Performance	3	423	December 6, 2021
PC Sampling leads to large slow-downs in execution time? CUPTI – CUDA Profiler Tools Interface	1	880	August 16, 2019
Getting Performance on Titan Legacy PGI Compilers	12	11810	December 27, 2016
cuStreamSynchronize killing the performance Legacy PGI Compilers	6	7028	April 5, 2016

Environment variable PGI_ACC_TIME accelerates process

Related topics