Xavier LPDDR4X memory throughput

Hi,

I have a question regarding to memory throughput of the Xavier.

In the specs it is written that the Xavier peek memory throughput is 137 GB/s,

But when I tried to measure the memory load throughput using the nvprof (nvprof --metric gld_throughput)

Of a very simple kernel which only copies data from input to output buffer I got that the average throughput is only 46.5 GB/s…

The following kernel was used:

//the following kernel is a dummy kernel which copies data from input to output buffers.
global void dummy_copy(const float * restrict input_mat,
float * restrict output_mat,
int Width)
{
int GlobalCol = blockIdx.x * blockDim.x + threadIdx.x;
int GlobalLine = blockIdx.y * blockDim.y + threadIdx.y;

  output_mat[GlobalLine*Width + GlobalCol] = input_mat[GlobalLine*Width + GlobalCol];

}


****the Xavier is in the highest performance mode (nvpmodel –m 0 jetson_clocks)

Why it is the case?? I expected to get a much higher throughput which is close to the spec because it is a very simple kernel which only loads and stores data…

Thanks,

Which power mode are you using?


****the Xavier is in the highest performance mode (nvpmodel –m 0 jetson_clocks)

Hi,

Would you mind to check this CUDA sample first?
/usr/local/cuda-10.0/samples/1_Utilities/bandwithTest/

Thanks.

Hi yairhav,

The CUDA bandwithTest sample result about 112GB/s.

sudo nvpmodel -m 0
sudo jetson_clocks

We got the same results as yours but this not our case.the example of co
our question was getting the load throughput inside the cuda kernel, we gave an example of a very simple kernel which only copies data from input to output buffers just to measure the memory throughput and were very disappointed to see the result(46.5GB/s)
why in the case of the cuda kernel we don’t get high bandwidth ??

Hi,

The use case is different.
It’s recommended to check our bandwidth sample first.

The shared test is composited of READ, WRITE and usually some cache missing.
If the index is not well-designed, it may also have some coalescing issue.

To improve the bandwidth, it’s recommended to improve the coalescing first.
You should get some information in this paper:
https://research.nvidia.com/sites/default/files/pubs/2019-02_Throughput-oriented-GPU-memory//paper.pdf

Thanks.