The total load size from SOC memory is 8192 * 2 * 4 = 65536 bytes, and the store size to SOC memory is 8192 * 1 * 4 = 32768 bytes. So the total transferred data size is 65536 + 32768 = 98308 bytes.
The kernel execution time is 5.25us, yielding a bandwidth of (98308 / 1e9) / (5.25 / 1e6) = 18.725 GB/s. We can thus assert that this kernel has achieved 45% (18.725 / 41.59) of the maximum memory bandwidth.
However, profiling the kernel using nsignt compute shows a maximum bandwidth of only 9.96%, as shown in the picture below. I am confused as to which part of the analysis may be incorrect.
Since GPU is cached, memory is loaded in segments rather than bytes.
Based on the below technique guide, the memory bandwidth for AGX Orin is up to 204 GB/s.
Thank you for your response. I have a few follow-up questions if you don’t mind.
I have read the technical brief you mentioned, and I realize that the CPU and the GPU share the same physical memory, is it right?
Since GPU is cached, memory is loaded in segments rather than bytes.
Just to clarify, when you say segments, are you referring to the memoryBusWidth parameter in the formula max_bandwidth = memoryBusWidth * memoryClockRate * 2 / 8 / 1e6, which is 128 bits on AGX Orin?
In a typical PC, the memory for the CPU and GPU are typically separate. As such, could it be correct to calculate the maximum bandwidth using the aforementioned formula?
If there is a way to calculate the bandwidth of 204 GB/s through the use of the CUDA runtime API? I am hoping to avoid hardcoding it directly into my code.
I have read the technical brief you mentioned, and I realize that the CPU and the GPU share the same physical memory, is it right?
If the answer to the first question is yes, I have a follow-up question: does data movement occur when using cudaMemcpy between the host and device? It should be more efficient compared to cudaMemcopy between different physical memory, correct?