We are looking for benchmarks that can give the peak FLOP/s and memory bandwidth on the Jetson AGX Orin.
https://github.com/NVIDIA-AI-IOT/jetson_benchmarks: We looked at this, and it seems to focus on deep learning workloads. We are interested in measuring the peak performance/bandwidth. Please recommend any standard benchmarks.
If these suggestions don’t help and you want to report an issue to us, please attach the model, command/step, and the customized app (if any) with us to reproduce locally.
I was able to run this and got the following result:
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 36.6
This is the bandwidth obtained in coping data from the CPU to the GPU, even though in this case, both use the same physical DRAM. Is my understanding of this correct?
Also, theoretical DRAM bandwidth is around 204 GB/s, while this shows 36.6. Is this expected? What kind of overheads are involved here?