I use the code from vattention/nvidia-vattn-uvm-driver/tests/cu_measure.cu at main · microsoft/vattention · GitHub to test the execution latency of Cuda VMM API on different platform
From the picture above,we can see that the latency of cuda vmm api on Jetson Orin is much longger than 4090 / A100, especially the cudaMemCreate api
So , What’s the reason result to the huge execution time on jetson orin , can there any method to reduce the overhead ?
Original Issue from : Latency of Cuda VMM API on jetson orin
Hi,
Could you share the environment of your Orin?
Is it JetPack 6.2 with CUDA 12.6?
On Jetson, please try below command to to set the device performance to maximize before benchmarking:
$ sudo nvpmodel -m 0
$ sudo jetson_clocks
Thanks.
By the way ,
when I am using cuMemAddresssReserve to Allocate gpu virtual memory several times on Jetson AGX Orin , it shows out of memory
However , same code executed on 4090 GPU , everything goes well .
What’s the max virtual address space size on jetson orin , why there is a difference between orin and 4090 ?
Hi,
We need to check with our internal team first.
Will provide more info to you later.
Thanks.
Hi,
CPU speed is critical when comparing performance.
The possible reason is that the x86 CPUs to which these GPUs are connected are far faster than the Orin CPU.
Thanks.
Thanks very much
Looking forward to your analysis and reply
Hi,
The comment on Feb 27 is the info from our internal team.
The score difference might be related to the CPU performance.
Thanks.
According to these analysis
for 4090, host system is ubuntu 22.04 with cuda 12.5, and Intel(R) Core™ i9-10980XE CPU @ 3.00GHz
for A100, the latency value is copied from paper : https://www.usenix.org/conference/osdi24/presentation/agrawal
for Orin, host system is ubuntu 22.04 with nvidia-jetpack 6.1
It is Jetson AGX Orin , GPU with 1.3GHz max frequency and 32GB UMA memory
CPU Performance may not be the root cause.
Hi,
Do you have the steps to run the test on AGX Orin?
We give it a try but meet the below error.
Could you share the detailed steps to generate the performance table as you reported?
$ ./vattn
Cannot find /dev/nvidia-uvm in /proc/self/fd
vattn.cu:20 vattn_init(0, HANDLE_SIZE)failed (4294967295)
More, could you also test our sample as well?
You might need to manually add some loop and timing function to get the execution time.
Thanks.