I have a strange phenomenon, a process that uses only asynchronous data transfers and kernel initiations runs much faster when the environment variable PGI_ACC_TIME is set to 1. Since this gathers timing information on all kernel calls I’d assume there is some overhead and it would run a bit slower, however it consistently runs faster, by about 40%. So my question is what changes when this variable is set that could possible lead to this performance gain?
That is odd and counter-intuitive. Our runtime profiler does introduce a bit of overhead so I’d expect a very slight slow down but it’s shouldn’t speed things up.
The only thing I can think of is that ACC_TIME may be forcing synchronization. Can you try using the environment flag “NVCOMPILER_ACC_SYNCHRONOUS=1” (without ACC_TIME) so “async” is disabled? (Note the “PGI” prefix still works, but is deprecated… We also accept the abbreviated “NV” prefix) .
If that’s not it, I’ll need to ask engineering for ideas, though if you could do some profiling to understand where the speed-up is coming from (all kernels? a particular kernel? Data transfer? Something else?) .
If it is async, then perhaps you’re using too many CUDA streams (OpenACC async queues)? Creating streams have a high overhead so can cause slow-downs if too many are used (best to use no more than 4 and re-use them).
when I set NVCOMPILER_ACC_SYNCHRONOUS to 1 it runs a bit slower with PGI_ACC_TIME set to 1 than without. In this program I am using 8 streams so this could be the reason that it is faster with the profiler turned on.
I will try to reduce the number of streams. My follow-up question, is this overhead of using CUDA streams due to API cost or due to hardware limitations? Meaning will this still be an issue when using a GPU with NVLINK?
Thanks for the help and kind regards,
It’s the start-up cost of creating a CUDA Stream.
Meaning will this still be an issue when using a GPU with NVLINK?
I don’t know for sure but I’d speculate that using NVLINK may help with the transfer time and thus offset more of the overhead, but wouldn’t effect the amount of start-up overhead.
I have profiled both cases with nsys and it looks like when PGI_ACC_TIME is set to 1, many kernels are executed in parallel (up to 5) while when it is set to 0 there is almost no concurrent execution of kernels on the GPU. Do you have any idea why this is happening? For me the concurrent execution would be preferable.
No sorry, I’ve never seen nor heard of anything like this and It’s counter-intuitive. Let me ask engineering for ideas.
I talked with Michael Wolfe and he’s just as puzzled as I am. At this point we’d need a reproducing example, or maybe send me the profiles in case something jumps out?
I have switched from NV Compiler suite 20.7 to 21.9, with this newer compiler the issue seems to be fixed.