Xavier LPDDR4X memory throughput

yairhav · April 30, 2019, 8:17am

Hi,

I have a question regarding to memory throughput of the Xavier.

In the specs it is written that the Xavier peek memory throughput is 137 GB/s,

But when I tried to measure the memory load throughput using the nvprof (nvprof --metric gld_throughput)

Of a very simple kernel which only copies data from input to output buffer I got that the average throughput is only 46.5 GB/s..

The following kernel was used:

//the following kernel is a dummy kernel which copies data from input to output buffers.
global void dummy_copy(const float * restrict input_mat,
float * restrict output_mat,
int Width)
{
int GlobalCol = blockIdx.x * blockDim.x + threadIdx.x;
int GlobalLine = blockIdx.y * blockDim.y + threadIdx.y;

  output_mat[GlobalLine*Width + GlobalCol] = input_mat[GlobalLine*Width + GlobalCol];

}

****the Xavier is in the highest performance mode (nvpmodel –m 0 jetson_clocks)

Why it is the case?? I expected to get a much higher throughput which is close to the spec because it is a very simple kernel which only loads and stores data..

Thanks,

WayneWWW · April 30, 2019, 8:20am

Which power mode are you using?

yairhav · April 30, 2019, 8:41am

****the Xavier is in the highest performance mode (nvpmodel –m 0 jetson_clocks)

AastaLLL · April 30, 2019, 8:58am

Hi,

Would you mind to check this CUDA sample first?
/usr/local/cuda-10.0/samples/1_Utilities/bandwithTest/

Thanks.

carolyuu · April 30, 2019, 9:22am

Hi yairhav,

The CUDA bandwithTest sample result about 112GB/s.

sudo nvpmodel -m 0
sudo jetson_clocks

yairhav · April 30, 2019, 10:29am

We got the same results as yours but this not our case.the example of co
our question was getting the load throughput inside the cuda kernel, we gave an example of a very simple kernel which only copies data from input to output buffers just to measure the memory throughput and were very disappointed to see the result(46.5GB/s)
why in the case of the cuda kernel we don’t get high bandwidth ??

AastaLLL · May 2, 2019, 8:33am

Hi,

The use case is different.
It’s recommended to check our bandwidth sample first.

The shared test is composited of READ, WRITE and usually some cache missing.
If the index is not well-designed, it may also have some coalescing issue.

To improve the bandwidth, it’s recommended to improve the coalescing first.
You should get some information in this paper:
[url]https://research.nvidia.com/sites/default/files/pubs/2019-02_Throughput-oriented-GPU-memory//paper.pdf[/url]

Thanks.

Topic		Replies	Views
CUDA sample about bandwithTest Jetson AGX Xavier cuda	2	321	October 18, 2021
CUDA sample about bandwithTest DRIVE AGX Xavier General cuda	3	386	October 12, 2021
CUDA kernel and Xavier performance Jetson AGX Xavier	8	1007	October 18, 2021
Xavier Memory Bandwidth on Pegasus DRIVE AGX Xavier General	8	1495	October 12, 2021
Thread read pattern causing memory throttle and dramatic fall in throughput on Xavier Jetson AGX Xavier	13	1577	October 18, 2021
Xavier PCIe performance Jetson AGX Xavier	13	3042	November 25, 2019
jetson xavier memory clock rate Jetson AGX Xavier	2	890	October 18, 2021
theoretical/real shared/dram peak memory throughput CUDA Programming and Performance	12	5237	January 5, 2017
Xavier agx / memory frequency as reported by cudaGetDeviceProperties Jetson AGX Xavier cuda	7	912	April 13, 2022
CPU performance is worse on the Xavier then the TX2 Jetson AGX Xavier	9	2373	October 18, 2021

Xavier LPDDR4X memory throughput

****the Xavier is in the highest performance mode (nvpmodel –m 0 jetson_clocks)

****the Xavier is in the highest performance mode (nvpmodel –m 0 jetson_clocks)

Related topics