Latency of Cuda VMM API on jetson orin

446456877 · February 24, 2025, 6:54am

I use the code from vattention/nvidia-vattn-uvm-driver/tests/cu_measure.cu at main · microsoft/vattention · GitHub to test the execution latency of Cuda VMM API on different platform

From the picture above，we can see that the latency of cuda vmm api on Jetson Orin is much longger than 4090 / A100, especially the cudaMemCreate api

So , What’s the reason result to the huge execution time on jetson orin , can there any method to reduce the overhead ?

njuffa · February 24, 2025, 7:53am

Much of the work of CUDA APIs occurs on the host side: The faster the host system, the lower the average latency of most CUDA API calls. What are the specifications of the host system used with the discrete GPUs in the table?

Jetson platforms integrate CPU and GPU and may therefore enjoy efficiencies not enjoyed by systems with discrete GPUs for some API calls.

BTW, is this a Jetson AGX Orin or a Jetson Orin Nano?

446456877 · February 25, 2025, 12:51am

for 4090, host system is ubuntu 22.04 with cuda 12.5, and Intel(R) Core™ i9-10980XE CPU @ 3.00GHz
for A100, the latency value is copied from paper : https://www.usenix.org/conference/osdi24/presentation/agrawal
for Orin, host system is ubuntu 22.04 with nvidia-jetpack 6.1
It is Jetson AGX Orin , GPU with 1.3GHz max frequency and 32GB UMA memory

njuffa · February 25, 2025, 2:04am

3 GHZ is the base frequency; this CPU boosts up to 4.8 GHz. Compared to the 2.2 GHz Arm Cortex-A78AE in the Jetson AGX Orin, this host platform probably provides 3x the performance based on higher clock frequencies and larger caches. That’s a rough estimate. I cannot find any good measured benchmarking data for these platforms.

The performance difference between the host system for the RTX 4090 and the Jetson AGX Orin in your table are generally larger (>= 5x) than the estimated factor of 3x. I cannot explain that based on paper specs. System memory throughput should not be a contributing factor, the Orin is specified with 200 GB/sec while the Intel CPU uses quad-channel DDR4 with about 90 GB/sec. While the Intel CPU also has more cores than the AGX Orin, this should not matter here as CUDA APIs mostly represent a single-threaded workload.

I don’t have any hands-on experience with Jetson platforms. Is it possible that it is held back by configuration issues? It would probably be a good idea to inquire in the sub-forum dedicated to the Jetson AGX Orin. that is where the experts hang out:

Curefab · February 25, 2025, 5:37am

As the memory on the Jetson platforms is unified and it directly accesses system memory (unlike for the 4090), can we expect the necessary processing to be comparable? Can you move those slower VMM API calls into an uncritical section of your program?

446456877 · February 25, 2025, 9:34am

Due to the slow VMM API，it is necessary to move those apis to uncritical Path .

However , I’m also curious about the internal execution logic about VMM API , and difference between those platforms, and if there exists optimization space on Jetson Orin，things would be better for using vmm apis.

446456877 · February 25, 2025, 9:38am

thx, I 'll move to the sub-forum

Topic		Replies	Views
High Execution Latency of Cuda VMM API on jetson AGX orin Jetson AGX Orin cuda	10	52	March 5, 2025
Performance Discrepancy - Python API vs. trtexec on Jetson AGX Orin Board Jetson AGX Orin jetson-inference	8	744	July 10, 2023
Question about cudaManagedMemory on Jetson AGX Jetson AGX Orin cuda	5	74	November 21, 2024
The emulating mode setting of Jetson AGX Orin Developer Kit Jetson AGX Orin	7	372	June 21, 2023
Jetson AGX Orin 32GB: Measured Memory Bandwidth Much Lower Than Theoretical Spec Jetson AGX Orin hw , jetson , level3	19	130	June 2, 2025
The performance of the Jetson Orin Nano module does not match the data provided on the official website Jetson AGX Orin cuda , performance	15	2587	September 28, 2023
Problems about running tinycudann on Jetson AGX Orin Jetson AGX Orin cuda	12	1400	April 25, 2023
Seeking Advice on Choosing an AI Processing Unit for Versatile AI Applications Jetson AGX Orin generative_ai	3	496	August 14, 2024
Jetson AGX ORIN vs RTX 4070 Super CUDA Programming and Performance	3	2522	June 10, 2024
Memory Leak on Jetson Orin when calculating ComputeCache with videotestsrc DeepStream SDK cuda , deepstream	2	49	February 10, 2025

Latency of Cuda VMM API on jetson orin

Related topics