Bandwidth measurement Theortical bandwidth vs BandwidthTest(SDK) results

Zepo · May 25, 2011, 11:14am

Hi everyone,
The theoretical Bandwidth for a GTX 480 is 177.408 GB/sec.
This value comes from www.gpreview.com, and it actually corresponds to the computation of the “CUDA C Best Practices” document, that is, (1848 MHz * 10e6 * (384bit/8) * 2(DDR))/10e9 = 177.408 GB/sec.
The profiler also agrees with this bandwidth.

Nevertheless, executing the BandwidthTest included in SDK leads to a smaller value.
Specifically, it shows a bandwidth around 118GB/sec, concerning the device to device copy (cudaMemcpy-deviceToDevice).

Does anyone know why there exists such a difference?

I need a reference to evaluate the performance of some kernels. My best solution achieves a 104Gb/s bandwidth, and I wonder where is the real limit for the bandwidth of my device (118 or 177?).
Thank you.

Jimmy_Pettersson · May 25, 2011, 2:44pm

The real limit is 177.

There is code that achieves up to ~91-93 % .

I achieved 148 GB/s with my reduction sum code posted here: The Official NVIDIA Forums | NVIDIA

Gert-Jan · May 25, 2011, 3:17pm

The bandwidth test in the SDK uses cudaMemcpy( … , … , … , cudaMemcpyDeviceToDevice) to measure the bandwidth. I read somewhere on this forum (I think) once that this doesn’t reach maximum performance, and you can get better by using a memcopy kernel (which you have to write yourself though).

In my opinion the best way to know the maximum bandwidth your application can achieve is to write a simple memcopy or matrix addition kernel. Use the same size matrix as you use in your real kernels, and the same number of threads per thread block and the same total thread block count. Make it as similar to your real kernels (also same data type), and you can probably measure best what performance you can get bandwidth wise.

This is a kernel I normally use for such kind of measurements:

__global__ void MatrixAdd_Kernel(int *in1, int *in2, int *out)

{

  int col_id = blockIdx.x * blockDim.x + threadIdx.x;

  int row_id = blockIdx.y * blockDim.y + threadIdx.y;

int tmp1 = in1[row_id * gridDim.x*blockDim.x + col_id];

  int tmp2 = in2[row_id * gridDim.x*blockDim.x + col_id];

out[row_id * gridDim.x*blockDim.x + col_id] = tmp1 + tmp2;

}

njuffa · May 25, 2011, 4:43pm

Gert-Jan remembers correctly. I am attaching a little CUDA app that determines memory bandwidth by copying a vector of double-precison numbers, similar to the way the STREAM benchmark does it. Hope this helps.
dcopy.cu (5.48 KB)

zkoza · May 30, 2011, 12:40pm

See: [url=“The Official NVIDIA Forums | NVIDIA”]The Official NVIDIA Forums | NVIDIA

ZK

Topic		Replies	Views
Device to device bandwidth, bandwidth test vs theoretical maximum CUDA Programming and Performance	7	3176	May 27, 2014
upper limit for memory bandwidth on the device ? CUDA Programming and Performance	13	11245	July 8, 2009
THEORETICAL BANDWIDTH vs EFFECTIVE BANDWIDTH CUDA Programming and Performance	13	6709	February 23, 2017
Using bandwidthTest, D2D performance exceeds theoretical bandwidth CUDA Programming and Performance cuda	1	391	October 27, 2022
Measuring Effective Bandwidth CUDA Programming and Performance	1	4640	February 20, 2011
Quadro 4000 Bandwidth The device to device bandwidth obtained with CUDA Programming and Performance	8	3514	March 7, 2011
Bandwidth calculation Newbie question... CUDA Programming and Performance	10	5388	August 1, 2008
Is my bandwidth calculation right? bandwidth CUDA Programming and Performance	3	1447	November 13, 2009
Maximum bandwith? CUDA Programming and Performance	4	4419	April 16, 2008
the theoretical device-device bandwidth CUDA Programming and Performance	6	3259	February 18, 2009

Bandwidth measurement Theortical bandwidth vs BandwidthTest(SDK) results

Related topics