Interesting, I ran this benchmark on a server with Nvidia GTX Titan and got better results than your k20c numbers above. The server is 2x10 core E5-2690-v2 with PciE-v3 and 8x8GB 1866MHz DDR3 RAM. CUDA is 5.5 and Nvidia driver version is 319.49.
nbody -benchmark -numbodies=212992 -device=0
1868.738 single-precision GFLOP/s at 20 flops per interaction
nbody -benchmark -numbodies=212992 -device=0 -fp64
742.626 double-precision GFLOP/s at 30 flops per interaction
Even more remarkable: if I put two GTX Titan I got more than linear speedup:
nbody -benchmark -numbodies=212992 -numdevices=2
7222.349 single-precision GFLOP/s at 20 flops per interaction
nbody -benchmark -numbodies=212992 -numdevices=2 -fp64
2661.470 double-precision GFLOP/s at 30 flops per interaction
Is this for real? Or is this nbody benchmark unreliable?