Simple CUDA Program has slow runtime


we noticed that some CUDA operations on our Xavier were running pretty slowly.

We have a very small test program, which we ran with different nvpmodel settings (we are on JetPack 4.4):
30w_all = 10.4ms
maxn = 7.7ms
maxn + clocks = 3.75ms

Could someone please verify, if these runtimes are normal for the Xavier module?

The Xavier module of one of our partners had much faster runtimes:
30w_all: 2.3ms
maxn+clocks: 0.9ms

How to run the test:

  1. unzip
  2. ./
  3. ./

To report the runtime, open the newly created “timeline.nvvp” with NVVP.
In the “GPU Details” window (lower left) there should be one operation called “count(int …”.
Please report the avg. duration for that operation and the nvpmodel you used.

Thanks in advance!

Here are the files: (1.2 KB)


Thanks for your reporting.
We are checking this and will share more information with you later.



We test your sample with JetPack4.4.1 on Xavier.
The results looks good from our experiment as below:

  • [maxn + jetson_clocks] = 889.445us
  • [maxn] = 2.37914ms
  • [30w_all + jetson_clocks] = 1.10764ms
  • [30w_all] = 2.44546ms

Do you use JetPack4.4.1? If not, could you give it a try?


Hello AastaLLL,

thank you for testing.
I think this proves that our Jetson somehow runs very slow.

Unfortunately we cannot upgrade to JP4.4.1 right now, but by reading the release notes I don’t see any major changes.
Are you sure that upgrading would change anything?
Why would it not work with JP 4.4.0?

What other reasons could cause this slow runtime?



There is an issue of the atomicAdd in the JetPack4.4.
And the fix is available in the rel32.4, which is the OS version of JetPack4.4.1.


Hi AastaLLL,

you are right!
We updatet to JP 4.4.1 and now the runtimes are normal:
[maxn + jetson_clocks] = 0.9ms