Bandwidth measurement Theortical bandwidth vs BandwidthTest(SDK) results

Hi everyone,
The theoretical Bandwidth for a GTX 480 is 177.408 GB/sec.
This value comes from, and it actually corresponds to the computation of the “CUDA C Best Practices” document, that is, (1848 MHz * 10e6 * (384bit/8) * 2(DDR))/10e9 = 177.408 GB/sec.
The profiler also agrees with this bandwidth.

Nevertheless, executing the BandwidthTest included in SDK leads to a smaller value.
Specifically, it shows a bandwidth around 118GB/sec, concerning the device to device copy (cudaMemcpy-deviceToDevice).

Does anyone know why there exists such a difference?

I need a reference to evaluate the performance of some kernels. My best solution achieves a 104Gb/s bandwidth, and I wonder where is the real limit for the bandwidth of my device (118 or 177?).
Thank you.

The real limit is 177.

There is code that achieves up to ~91-93 % .

I achieved 148 GB/s with my reduction sum code posted here:

The bandwidth test in the SDK uses cudaMemcpy( … , … , … , cudaMemcpyDeviceToDevice) to measure the bandwidth. I read somewhere on this forum (I think) once that this doesn’t reach maximum performance, and you can get better by using a memcopy kernel (which you have to write yourself though).

In my opinion the best way to know the maximum bandwidth your application can achieve is to write a simple memcopy or matrix addition kernel. Use the same size matrix as you use in your real kernels, and the same number of threads per thread block and the same total thread block count. Make it as similar to your real kernels (also same data type), and you can probably measure best what performance you can get bandwidth wise.

This is a kernel I normally use for such kind of measurements:

__global__ void MatrixAdd_Kernel(int *in1, int *in2, int *out)


  int col_id = blockIdx.x * blockDim.x + threadIdx.x;

  int row_id = blockIdx.y * blockDim.y + threadIdx.y;

int tmp1 = in1[row_id * gridDim.x*blockDim.x + col_id];

  int tmp2 = in2[row_id * gridDim.x*blockDim.x + col_id];

out[row_id * gridDim.x*blockDim.x + col_id] = tmp1 + tmp2;


Gert-Jan remembers correctly. I am attaching a little CUDA app that determines memory bandwidth by copying a vector of double-precison numbers, similar to the way the STREAM benchmark does it. Hope this helps. (5.48 KB)