Much of the work of CUDA APIs occurs on the host side: The faster the host system, the lower the average latency of most CUDA API calls. What are the specifications of the host system used with the discrete GPUs in the table?
Jetson platforms integrate CPU and GPU and may therefore enjoy efficiencies not enjoyed by systems with discrete GPUs for some API calls.
BTW, is this a Jetson AGX Orin or a Jetson Orin Nano?
3 GHZ is the base frequency; this CPU boosts up to 4.8 GHz. Compared to the 2.2 GHz Arm Cortex-A78AE in the Jetson AGX Orin, this host platform probably provides 3x the performance based on higher clock frequencies and larger caches. That’s a rough estimate. I cannot find any good measured benchmarking data for these platforms.
The performance difference between the host system for the RTX 4090 and the Jetson AGX Orin in your table are generally larger (>= 5x) than the estimated factor of 3x. I cannot explain that based on paper specs. System memory throughput should not be a contributing factor, the Orin is specified with 200 GB/sec while the Intel CPU uses quad-channel DDR4 with about 90 GB/sec. While the Intel CPU also has more cores than the AGX Orin, this should not matter here as CUDA APIs mostly represent a single-threaded workload.
I don’t have any hands-on experience with Jetson platforms. Is it possible that it is held back by configuration issues? It would probably be a good idea to inquire in the sub-forum dedicated to the Jetson AGX Orin. that is where the experts hang out:
As the memory on the Jetson platforms is unified and it directly accesses system memory (unlike for the 4090), can we expect the necessary processing to be comparable? Can you move those slower VMM API calls into an uncritical section of your program?
Due to the slow VMM API,it is necessary to move those apis to uncritical Path .
However , I’m also curious about the internal execution logic about VMM API , and difference between those platforms, and if there exists optimization space on Jetson Orin,things would be better for using vmm apis.