Confused about memory bandwidth

Hi,

On Jetson AGX Orin, I have found that the maximum GPU bandwidth can be calculated by using the following formula:

max_bandwidth = memoryBusWidth * memoryClockRate * 2 / 8 / 1e6

This results in a bandwidth of 41.59 GB/s.

Then I launch a sumArraysGPU kernel with an element size of 8192.

__global__ void sumArraysGPU(const float* a, const float* b, float* c, int32_t n) {
  int32_t idx = threadIdx.x + blockIdx.x * blockDim.x;
  if (idx < n) {
    c[idx] = a[idx] + b[idx];
  }
}

The total load size from SOC memory is 8192 * 2 * 4 = 65536 bytes, and the store size to SOC memory is 8192 * 1 * 4 = 32768 bytes. So the total transferred data size is 65536 + 32768 = 98308 bytes.
The kernel execution time is 5.25us, yielding a bandwidth of (98308 / 1e9) / (5.25 / 1e6) = 18.725 GB/s. We can thus assert that this kernel has achieved 45% (18.725 / 41.59) of the maximum memory bandwidth.

However, profiling the kernel using nsignt compute shows a maximum bandwidth of only 9.96%, as shown in the picture below. I am confused as to which part of the analysis may be incorrect.

Hi,

Since GPU is cached, memory is loaded in segments rather than bytes.
Based on the below technique guide, the memory bandwidth for AGX Orin is up to 204 GB/s.

https://www.nvidia.com/content/dam/en-zz/Solutions/gtcf21/jetson-orin/nvidia-jetson-agx-orin-technical-brief.pdf

Thanks.

Thank you for your response. I have a few follow-up questions if you don’t mind.

  1. I have read the technical brief you mentioned, and I realize that the CPU and the GPU share the same physical memory, is it right?

  2. Since GPU is cached, memory is loaded in segments rather than bytes.

    Just to clarify, when you say segments, are you referring to the memoryBusWidth parameter in the formula max_bandwidth = memoryBusWidth * memoryClockRate * 2 / 8 / 1e6, which is 128 bits on AGX Orin?

  3. In a typical PC, the memory for the CPU and GPU are typically separate. As such, could it be correct to calculate the maximum bandwidth using the aforementioned formula?

  4. If there is a way to calculate the bandwidth of 204 GB/s through the use of the CUDA runtime API? I am hoping to avoid hardcoding it directly into my code.

  1. I have read the technical brief you mentioned, and I realize that the CPU and the GPU share the same physical memory, is it right?

If the answer to the first question is yes, I have a follow-up question: does data movement occur when using cudaMemcpy between the host and device? It should be more efficient compared to cudaMemcopy between different physical memory, correct?

Hi,

1. Yes, CPU and GPU share the same physical memory.

2. Not really.
Usually, OS load a big chunk of data from disk to memory.

3. In general, you can find the info on spec directly.

4. You can find a sample below:

/usr/local/cuda-11.4/samples/1_Utilities/bandwidthTest/

5. Yes, on Jetson, there are some zero-copy memory types are available.

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.