Question regarding peak memory bandwidth Gigabytes or Gibibytes per second?

I’ve been measuring the peak memory bandwidth I can obtain in some linear algebra kernels I’ve written. I want to make sure I am calculating my percentage utilization correctly (or really check that Nvidia is reporting their bandwidth correctly). The most I seem to be able to get out of a GTX 480 is 150 GiB/s = 161 GB/s. Nvidia reports the peak memory bandwidth of the 480 as 177.4 GB/s. Since they state GB/s and not GiB/s, I presume that they are counting memory bandwidth with base-10 (Gigabyte = 10^9 bytes) not base-2 (Gibabyte = 2^30 bytes)?

If Nvidia use base-10, then I can achieve about 90% of peak bandwidth, but if they’re using base-2, then my utilization is 85%. So which is the correct measure?

in their bandwidth test, when outputs are reported in MB/s, the code is

//calculate bandwidth in MB/s

    bandwidthInMBs = (1e3f * memSize * (float)MEMCOPY_ITERATIONS) / 

                                        (elapsedTimeInMs * (float)(1 << 20));

so that is in base 2 (or base 1024 if you wish). I would assume the same holds for other numbers. 85% of peak bandwidth sounds like a very nice number to me.

The C2050 peak bandwidth is 1500 MHz * (384 bits / 8) * 2 [for GDDR5] = 144 * 10^9 bytes/sec = 144 GByte/sec
http://www.nvidia.com/object/product_tesla_C2050_C2070_us.html

The GTX480 peak bandwidth is 1848 MHz * (384 bits / 8) * 2 [for GDDR5] = 177.4 * 10^9 bytes/sec = 177.6 GByte/sec
http://www.nvidia.com/object/product_geforce_gtx_480_us.html

Thanks for the reply. So it looks like I can achieve up to 90% of potential bandwidth then since Nvidia uses the correct base-10 Giga definition. Hooray for SI standards :smile:

Another question, though not related. If I use the fmaf function in my code, how will this be interpreted on GT200? Will it issue a MAD instruction, or will it perform a software emulated IEEE compliant FMA? What I need is to be able to write code that is guaranteed to use the fast FMA on Fermi, and the fast MAD on GT200. I’ve found I can’t always rely on the compiler to always issue a FMA/MAD instruction from “a += b*c” type code.

fmaf() is a math library function defined in C99 and requires fused multiply-add functionality. Therefore it maps to a single-precision FMA instruction on sm_2x, and a software emulation on sm_1x where there is no matching hardware instruction.

To generate either an FMA on sm_2x or an FMAD on sm_1x the CUDA math library uses an inlined function, along the following lines:

__device__ __forceinline__ float my_fmadf (float a, float b, float c)

{

#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 200)

  return __fma_rn(a, b, c);

#else

  return a * b + c;

#endif 

}

While this does not guarantee the generation of an FMAD on sm_1x, one can get quite close in practice, especially when one breaks up longer computation such that each call to my_fmadf() becomes a separate expression whose result is assigned to a temporary variable. See the math library sources (i.e. header file math_functions.h) for numerous worked examples.