Performance degradation from cuda 10.0 to cuda 10.2

I have two xavier agx systems, both running cuda10.0, but with different driver versions (they were flashed using different versions of jetpack), I ended up seeing about 5% performance degradation for my kernels in the system which were flashed using the newer version of the jetpack, is this normal?

The newer jetpack actually installed cuda10.2 which were even slower, downgrading it to 10.0 makes things a bit faster but the gap is not yet completely gone.

Hi,

We don’t find any performance drop from CUDA 10.0 to CUDA 10.2.
When you test the performance, have you maximized the device performance first?

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

If yes, would you mind sharing the performance you observe (as well as the corresponding setup) with us?
We want to reproduce this issue in our environment first.

Thanks.

Yes that was the first things I did. The performance number is from nsys stats after I profiled the app.
I have a tar file with the folder structure and Makefile to reproduce the result. How do I give it to you?

So to reproduce the performance degradation, download the following tar and decompress. Go to SimpleComparison/MultComparison/ComplexMult and type make. After that, do sh run.sh to profile the executable.

Do the above steps on an xavier with Jetpack 4.3 and one with Jetpack 4.6 which carries cuda 10.0 and 10.2 respectively, you should see the profiled performance number having a significant difference.

Hi,

Thanks for sharing the sample.

We are going to reproduce this issue internally.
Will share more information with you later.

Hi,

Thanks for your patience.
Confirmed that we can reproduce this issue in our environment as well.

We are now checking this with our internal team.
Will share more information with you later.

1 Like

Hi,

Thanks for your patience.

Please noted that GPU execution time can depend on lots of things, ex. the current workload of the GPU, resource availability for the work, etc.
So we don’t guarantee a really tight bound on the GPU execution time of any kernel once it reaches the GPU.

But below are some experiments we have tried: change.patch (3.4 KB)

1. Do some warm-up loops of the kernel.

2. Measure the GPU execution time via CUDA events.

With the above changes, we can see a very similar performance (even slightly better) on JetPack4.6.
Would you mind also checking this on your side?

JetPack 4.3

Total elapsed time = 18983.760ms, average time = 1.898ms
...
Time(%)  Total Time (ns)  Instances   Average    Minimum   Maximum    StdDev                                   Name                                 
 -------  ---------------  ---------  ----------  --------  --------  --------  ---------------------------------------------------------------------
   100.0    191,7907,0656     1,0100  189,8917.9  182,2208  436,2752  8,9854.0  ComplexMult(complex_float*, complex_float*, complex_float*, int, int)

JetPack-4.6:

Total elapsed time = 18558.316ms, average time = 1.856ms
...
 Time(%)  Total Time (ns)  Instances    Average     Minimum    Maximum    StdDev                                   Name                                 
 -------  ---------------  ---------  -----------  ---------  ---------  --------  ---------------------------------------------------------------------
   100.0   18,738,775,456     10,100  1,855,324.3  1,806,496  5,056,928  84,044.6  ComplexMult(complex_float*, complex_float*, complex_float*, int, int)

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.