We have implemented an inference model in regular CUDA using the regular set of features: cublas, custom kernels, streams. The entire workflow is setup asynchronously and runs in about 130 micros (including copy-to-from device) on a Titan X (Pascal). We use CMake 3.18 for generating build files. I’m on Windows 10 1909 and I have VS 2019 16.7. Running in TCC mode.
I recently tried to upgrade to CUDA 11 but we experience a performance degradation of about 10-20 micros.
Downgrading to CUDA 10.1 removes the issue, even when running the same drivers (452.06).
Have anyone else experienced this? Are there some compile flags which I should change because they are known to cause issues in 11? Or is it expected due to my older card?
Thanks in advance.