Inconsistent performance on the A100


I have a program that uses multithreading and each thread uses its own pipeline to communicate with the GPU. On a V100 I get consistent good performance. On the A100 the performance varies by almost 50 % from run to run. I could not find any obvious difference between the runs, but Nsight Compute shows much lower GPU throughput between the runs, Compute and Memory throughput vary a lot. Is there any settings one has to use to get consistent performance or where can this variation come from?


I’m not sure Rob. Let’s see if Bob has any ideas.


The profiler(s) may unexpectedly serialize activity from multiple threads. Use the latest possible versions of the profilers to get best results in multi-threading scenarios.

Also, in the CUDA 10-11 timeframe (10.0 → 11.6, currently) there have been improvements in the CUDA runtime handling of threads and streams. So either I would make sure that I’m using an identical configuration (CUDA versions between the A100 and V100 setups) to get an apples-apples comparison, or I would promote the A100 config to the latest possible.

Those are pretty general statements, however. I don’t have any specific suggestions about things that would be different to get maximum performance from A100 vs. V100. I’m not aware of any intentional differences in multithreading behavior. If you have a short, self-contained test case that would demonstrate it, it would probably be interesting to inspect here, and/or a suitable basis to file a bug.

Hi Robert,

I am using CUDA 11.5 and NV SDK 22.1. The difference was observable not only in the profiler but also in regular runtime. However I found the reason, the code had a lot of non-coalesced memory accesses, while changing these accesses to coalesced did not change the performance on the V100 (actually it made it a little bit slower 2%) it now gives pretty consistent performance on the A100. I find it strange that it does make a big difference on the A100 while not on the V100, any idea why this is the case?
I did not manage to make a small example with which the behaviour could be reproduced, in simpler code it did not occur.


No, sorry, I don’t have any insight here. None of it makes much sense to me. I would expect that a GPU with more overall available bandwidth (A100) would be a bit less sensitive to inefficient memory usage, not the other way around as you are reporting. I also find it strange that going from non-coalesced access to coalesced access (if there was significant activity there) would make no difference or make the situation slightly worse on V100.

I don’t have any insight or speculation here.