Much worse performance after updating cuda toolkit from 10.2 to 11.4

Recently, I have updated cuda from 10.2 to 11.4. I noticed that the performance is over 10x worse with the same application. What could be the reason?


Most common reason for this kind of observation: one is a release build, the other is a debug build.

I had both compiled using Release.

It could be a code generation problem in the compiler (nvcc).