I’ve been measuring the peak memory bandwidth I can obtain in some linear algebra kernels I’ve written. I want to make sure I am calculating my percentage utilization correctly (or really check that Nvidia is reporting their bandwidth correctly). The most I seem to be able to get out of a GTX 480 is 150 GiB/s = 161 GB/s. Nvidia reports the peak memory bandwidth of the 480 as 177.4 GB/s. Since they state GB/s and not GiB/s, I presume that they are counting memory bandwidth with base-10 (Gigabyte = 10^9 bytes) not base-2 (Gibabyte = 2^30 bytes)?
If Nvidia use base-10, then I can achieve about 90% of peak bandwidth, but if they’re using base-2, then my utilization is 85%. So which is the correct measure?
so that is in base 2 (or base 1024 if you wish). I would assume the same holds for other numbers. 85% of peak bandwidth sounds like a very nice number to me.
Thanks for the reply. So it looks like I can achieve up to 90% of potential bandwidth then since Nvidia uses the correct base-10 Giga definition. Hooray for SI standards External Image
Another question, though not related. If I use the fmaf function in my code, how will this be interpreted on GT200? Will it issue a MAD instruction, or will it perform a software emulated IEEE compliant FMA? What I need is to be able to write code that is guaranteed to use the fast FMA on Fermi, and the fast MAD on GT200. I’ve found I can’t always rely on the compiler to always issue a FMA/MAD instruction from “a += b*c” type code.
fmaf() is a math library function defined in C99 and requires fused multiply-add functionality. Therefore it maps to a single-precision FMA instruction on sm_2x, and a software emulation on sm_1x where there is no matching hardware instruction.
To generate either an FMA on sm_2x or an FMAD on sm_1x the CUDA math library uses an inlined function, along the following lines:
__device__ __forceinline__ float my_fmadf (float a, float b, float c)
{
#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 200)
return __fma_rn(a, b, c);
#else
return a * b + c;
#endif
}
While this does not guarantee the generation of an FMAD on sm_1x, one can get quite close in practice, especially when one breaks up longer computation such that each call to my_fmadf() becomes a separate expression whose result is assigned to a temporary variable. See the math library sources (i.e. header file math_functions.h) for numerous worked examples.