I am using NVIDIA C870 GPU for accelerating my neural network application. I want to know the theoretical peak FLOP/s and memory BW. I calculated the theoretical peak FLOP as: Peak
FLOP/s = 1.35 GHz * 128 cores * 2 FLOPs = 345.6 GFLOP/s. But in the C870 installation guide it says that it has the performance of over 500 GFLOP/s. Which is right? Am I missing something?
I have other question, the C870 installation guide also says that the Device memory BW is 76.8 GB/s. I am wondering, how this value is calculated or found?
The C870 nominally has the ability to process a multiply-add and a multiply instruction at the same time, giving 3 floating point operations per clock, instead of two. After they were released, it was found that a scheduling limitation meant that the dual-issue MUL was very unlikely to occur, so in practice the peak is more like what you calculated. The GT200 series fixes this problem and is apparently able to dual issue the MUL with greater probability, if it is available in the instruction stream.
The memory bandwidth is computed by [memory clock] * [memory bus width in bits / 8] * 2 (for DDR) = 800 MHz * (384/8) bytes * 2 = 76.8 GB/sec.
I have another question about the host to device bandwidth measurement. After running an example (BW test) inside CUDA SDK in my system (Intel core 2 quad 2.66 GHz with Tesla C870), I found that H-to-D bandwidth for my system is 1.5 GB/s. Is there any way to measure the theoretical peak H-to-D bandwidth for this kind of system?
Since the C870 is a PCI-Express 1.0 device, the theoretical maximum is 4 GB/sec in either direction. However, most motherboards fall short of this. (And there can be a surprising amount of variation between motherboard models.)
In addition, there is a large difference between normal memory transfers and “pinned memory” transfers. You should also run the bandwidth test with --memory=pinned, as that will get you closest to the theoretical bandwidth limit. You will probably see 2.5-3 GB/sec with that option.