We have implemented an inference model in regular CUDA using the regular set of features: cublas, custom kernels, streams. The entire workflow is setup asynchronously and runs in about 130 micros (including copy-to-from device) on a Titan X (Pascal). We use CMake 3.18 for generating build files. I’m on Windows 10 1909 and I have VS 2019 16.7. Running in TCC mode.
I recently tried to upgrade to CUDA 11 but we experience a performance degradation of about 10-20 micros.
Downgrading to CUDA 10.1 removes the issue, even when running the same drivers (452.06).
Have anyone else experienced this? Are there some compile flags which I should change because they are known to cause issues in 11? Or is it expected due to my older card?
Thanks in advance.
Performance fluctuations of this magnitude between CUDA versions are fairly common, and typically result from a mix of compiler code generation, library, and driver changes. By observation, code generation for mature platforms usually changes little or not at all and this should certainly apply to the Pascal architecture at this time. So the issue is probably not with the compiler.
As a rule of thumb, at application level, a 2% performance difference is considered measurement noise, while a 5% regression is the lowest bound for an actionable but low-priority enhancement request or performance bug. As your performance regression is above the cut-off limit, it might be worthwhile to file a bug with NVIDIA. To prepare for that, you would want to do some profiling and try some code simplifications to narrow down the specific source of the slowdown and come up with a reproducer that is as small as possible.
Historical precedent indicates that low-priority performance issues might take a long time to get addressed (think a year or so; possibly never).
Thanks for the reply njuffa.
Recreating the exact cause of the slowdown will be out of scope for us, especially if its known that we might not even get a fix. I would like to hear if other projects have experienced the same slowdown.
As to the cause, since the driver didn’t affect the performance, and the only library we use is cuBLAS (which is under heavy scrutiny for performance), I would guess the bug is in the compiler generated code.
Could you try adding
-extra-device-vectorization to the NVCC command, with CUDA 11?
Thanks for reaching out mnicely.
I already compiled with -O3 but I just tried to explicitly add
-extra-device-vectorization flag, and run our benchmarks again. About 2-6usec faster but still a performance regression of about 8-10%.
In the meantime I have also tried out CUDA 11.1 with the same results.
One thing that helped was using CUDA graphs for our model. The difference in time between CUDA 11.1 and CUDA 10.2 executions when using graphs is more minimal (~1usec so probably noise).
This leads me to believe that its a performance degradation on the CPU code generation, but I’m not completely familiar with the internals of CUDA graphs.