I have a strange phenomenon, a process that uses only asynchronous data transfers and kernel initiations runs much faster when the environment variable PGI_ACC_TIME is set to 1. Since this gathers timing information on all kernel calls I’d assume there is some overhead and it would run a bit slower, however it consistently runs faster, by about 40%. So my question is what changes when this variable is set that could possible lead to this performance gain?
That is odd and counter-intuitive. Our runtime profiler does introduce a bit of overhead so I’d expect a very slight slow down but it’s shouldn’t speed things up.
The only thing I can think of is that ACC_TIME may be forcing synchronization. Can you try using the environment flag “NVCOMPILER_ACC_SYNCHRONOUS=1” (without ACC_TIME) so “async” is disabled? (Note the “PGI” prefix still works, but is deprecated… We also accept the abbreviated “NV” prefix) .
If that’s not it, I’ll need to ask engineering for ideas, though if you could do some profiling to understand where the speed-up is coming from (all kernels? a particular kernel? Data transfer? Something else?) .
If it is async, then perhaps you’re using too many CUDA streams (OpenACC async queues)? Creating streams have a high overhead so can cause slow-downs if too many are used (best to use no more than 4 and re-use them).
when I set NVCOMPILER_ACC_SYNCHRONOUS to 1 it runs a bit slower with PGI_ACC_TIME set to 1 than without. In this program I am using 8 streams so this could be the reason that it is faster with the profiler turned on.
I will try to reduce the number of streams. My follow-up question, is this overhead of using CUDA streams due to API cost or due to hardware limitations? Meaning will this still be an issue when using a GPU with NVLINK?
Meaning will this still be an issue when using a GPU with NVLINK?
I don’t know for sure but I’d speculate that using NVLINK may help with the transfer time and thus offset more of the overhead, but wouldn’t effect the amount of start-up overhead.
I have profiled both cases with nsys and it looks like when PGI_ACC_TIME is set to 1, many kernels are executed in parallel (up to 5) while when it is set to 0 there is almost no concurrent execution of kernels on the GPU. Do you have any idea why this is happening? For me the concurrent execution would be preferable.
I talked with Michael Wolfe and he’s just as puzzled as I am. At this point we’d need a reproducing example, or maybe send me the profiles in case something jumps out?