Inconsistent performance on the A100

rob_v8 · March 18, 2022, 4:10pm

Hi,

I have a program that uses multithreading and each thread uses its own pipeline to communicate with the GPU. On a V100 I get consistent good performance. On the A100 the performance varies by almost 50 % from run to run. I could not find any obvious difference between the runs, but Nsight Compute shows much lower GPU throughput between the runs, Compute and Memory throughput vary a lot. Is there any settings one has to use to get consistent performance or where can this variation come from?

Regards,
Rob

MatColgrove · March 18, 2022, 4:30pm

I’m not sure Rob. Let’s see if Bob has any ideas.

-Mat

Robert_Crovella · March 18, 2022, 5:04pm

The profiler(s) may unexpectedly serialize activity from multiple threads. Use the latest possible versions of the profilers to get best results in multi-threading scenarios.

Also, in the CUDA 10-11 timeframe (10.0 → 11.6, currently) there have been improvements in the CUDA runtime handling of threads and streams. So either I would make sure that I’m using an identical configuration (CUDA versions between the A100 and V100 setups) to get an apples-apples comparison, or I would promote the A100 config to the latest possible.

Those are pretty general statements, however. I don’t have any specific suggestions about things that would be different to get maximum performance from A100 vs. V100. I’m not aware of any intentional differences in multithreading behavior. If you have a short, self-contained test case that would demonstrate it, it would probably be interesting to inspect here, and/or a suitable basis to file a bug.

rob_v8 · March 22, 2022, 8:53am

Hi Robert,

I am using CUDA 11.5 and NV SDK 22.1. The difference was observable not only in the profiler but also in regular runtime. However I found the reason, the code had a lot of non-coalesced memory accesses, while changing these accesses to coalesced did not change the performance on the V100 (actually it made it a little bit slower 2%) it now gives pretty consistent performance on the A100. I find it strange that it does make a big difference on the A100 while not on the V100, any idea why this is the case?
I did not manage to make a small example with which the behaviour could be reproduced, in simpler code it did not occur.

Regards,
Rob

Robert_Crovella · March 22, 2022, 2:31pm

No, sorry, I don’t have any insight here. None of it makes much sense to me. I would expect that a GPU with more overall available bandwidth (A100) would be a bit less sensitive to inefficient memory usage, not the other way around as you are reporting. I also find it strange that going from non-coalesced access to coalesced access (if there was significant activity there) would make no difference or make the situation slightly worse on V100.

I don’t have any insight or speculation here.

Topic		Replies	Views
Same kernel and data exhibits different performance CUDA Programming and Performance	3	487	December 3, 2021
P100 non-deterministic results with dynamic parallelism CUDA Programming and Performance	4	718	January 23, 2022
Multiple CPU threads with multiple cudaStreams CUDA Programming and Performance	5	6116	July 23, 2015
Is the profiling session duration equivalent to total runtime when using Nsight Systems? Profiling Linux Targets cuda , kernel , profiling	11	490	May 6, 2024
OpenACC and CUFFT performance issues HPC CUDA Programming and Performance cuda , performance	1	379	December 1, 2023
Visual Profiler displays erroneous output with multiple GPUs Profiler problem on multi-gpu scaling b CUDA Programming and Performance	0	791	May 9, 2012
A100 simplemulticopy CUDA Programming and Performance	14	115	August 23, 2024
Why would code run 1.7x faster when run with nvprof than without? CUDA Programming and Performance	35	3137	December 28, 2017
kernel runs much faster when being profiled with Visual Profiler Visual Profiler and nvprof	4	4690	August 29, 2014
8x GPU app profiles parallel GPU kernel exec in NVVP, but kernels exec serial from cmd line CUDA Programming and Performance	5	561	September 15, 2020

Inconsistent performance on the A100

Related topics