Latency of Cuda VMM API on jetson orin

I use the code from vattention/nvidia-vattn-uvm-driver/tests/cu_measure.cu at main · microsoft/vattention · GitHub to test the execution latency of Cuda VMM API on different platform

From the picture above,we can see that the latency of cuda vmm api on Jetson Orin is much longger than 4090 / A100, especially the cudaMemCreate api

So , What’s the reason result to the huge execution time on jetson orin , can there any method to reduce the overhead ?

Much of the work of CUDA APIs occurs on the host side: The faster the host system, the lower the average latency of most CUDA API calls. What are the specifications of the host system used with the discrete GPUs in the table?

Jetson platforms integrate CPU and GPU and may therefore enjoy efficiencies not enjoyed by systems with discrete GPUs for some API calls.

BTW, is this a Jetson AGX Orin or a Jetson Orin Nano?

  1. for 4090, host system is ubuntu 22.04 with cuda 12.5, and Intel(R) Core™ i9-10980XE CPU @ 3.00GHz
  2. for A100, the latency value is copied from paper : https://www.usenix.org/conference/osdi24/presentation/agrawal
  3. for Orin, host system is ubuntu 22.04 with nvidia-jetpack 6.1
  4. It is Jetson AGX Orin , GPU with 1.3GHz max frequency and 32GB UMA memory

3 GHZ is the base frequency; this CPU boosts up to 4.8 GHz. Compared to the 2.2 GHz Arm Cortex-A78AE in the Jetson AGX Orin, this host platform probably provides 3x the performance based on higher clock frequencies and larger caches. That’s a rough estimate. I cannot find any good measured benchmarking data for these platforms.

The performance difference between the host system for the RTX 4090 and the Jetson AGX Orin in your table are generally larger (>= 5x) than the estimated factor of 3x. I cannot explain that based on paper specs. System memory throughput should not be a contributing factor, the Orin is specified with 200 GB/sec while the Intel CPU uses quad-channel DDR4 with about 90 GB/sec. While the Intel CPU also has more cores than the AGX Orin, this should not matter here as CUDA APIs mostly represent a single-threaded workload.

I don’t have any hands-on experience with Jetson platforms. Is it possible that it is held back by configuration issues? It would probably be a good idea to inquire in the sub-forum dedicated to the Jetson AGX Orin. that is where the experts hang out:

As the memory on the Jetson platforms is unified and it directly accesses system memory (unlike for the 4090), can we expect the necessary processing to be comparable? Can you move those slower VMM API calls into an uncritical section of your program?

Due to the slow VMM API,it is necessary to move those apis to uncritical Path .

However , I’m also curious about the internal execution logic about VMM API , and difference between those platforms, and if there exists optimization space on Jetson Orin,things would be better for using vmm apis.

thx, I 'll move to the sub-forum