Inconsistent performance on the A100

The profiler(s) may unexpectedly serialize activity from multiple threads. Use the latest possible versions of the profilers to get best results in multi-threading scenarios.

Also, in the CUDA 10-11 timeframe (10.0 → 11.6, currently) there have been improvements in the CUDA runtime handling of threads and streams. So either I would make sure that I’m using an identical configuration (CUDA versions between the A100 and V100 setups) to get an apples-apples comparison, or I would promote the A100 config to the latest possible.

Those are pretty general statements, however. I don’t have any specific suggestions about things that would be different to get maximum performance from A100 vs. V100. I’m not aware of any intentional differences in multithreading behavior. If you have a short, self-contained test case that would demonstrate it, it would probably be interesting to inspect here, and/or a suitable basis to file a bug.