High Execution Latency of Cuda VMM API on jetson AGX orin

I use the code from vattention/nvidia-vattn-uvm-driver/tests/cu_measure.cu at main · microsoft/vattention · GitHub to test the execution latency of Cuda VMM API on different platform

From the picture above,we can see that the latency of cuda vmm api on Jetson Orin is much longger than 4090 / A100, especially the cudaMemCreate api

So , What’s the reason result to the huge execution time on jetson orin , can there any method to reduce the overhead ?

Original Issue from : Latency of Cuda VMM API on jetson orin

Hi,

Could you share the environment of your Orin?
Is it JetPack 6.2 with CUDA 12.6?

On Jetson, please try below command to to set the device performance to maximize before benchmarking:

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Thanks.

  1. JetPack 6.1 with Cuda 12.6
  2. I’m already using the command to set to the max performance mode

By the way ,

when I am using cuMemAddresssReserve to Allocate gpu virtual memory several times on Jetson AGX Orin , it shows out of memory

However , same code executed on 4090 GPU , everything goes well .

What’s the max virtual address space size on jetson orin , why there is a difference between orin and 4090 ?

Hi,

We need to check with our internal team first.
Will provide more info to you later.

Thanks.

Hi,

CPU speed is critical when comparing performance.
The possible reason is that the x86 CPUs to which these GPUs are connected are far faster than the Orin CPU.

Thanks.

Thanks very much

Looking forward to your analysis and reply

Hi,

The comment on Feb 27 is the info from our internal team.
The score difference might be related to the CPU performance.

Thanks.

According to these analysis

CPU Performance may not be the root cause.

Hi,

Do you have the steps to run the test on AGX Orin?
We give it a try but meet the below error.
Could you share the detailed steps to generate the performance table as you reported?

$ ./vattn 
Cannot find /dev/nvidia-uvm in /proc/self/fd
vattn.cu:20 vattn_init(0, HANDLE_SIZE)failed (4294967295) 

More, could you also test our sample as well?
You might need to manually add some loop and timing function to get the execution time.

Thanks.