use nbody test p4 performance is so bad

hi,when i used nbody test the performance of Tesla P4 device,the command is liky this:
./nbody -benchmark -numbodies=256000

And i get the result is like this:

Windowed mode
Simulation data stored in video memory
Single precision floating point simulation
1 Devices used for simulation
GPU Device 0: “Tesla P4” with compute capability 6.1

Compute 6.1 CUDA device: [Tesla P4]
number of bodies = 256000
256000 bodies, total time for 10 iterations: 3456.762 ms
= 189.588 billion interactions per second
= 3791.757 single-precision GFLOP/s at 20 flops per interaction

but in the datasheet of P4, the single-precision should be 5500 GFLOP/s.
where is the problem? and what should do to get the 5500 result.
best regards for you.

The datasheet numbers are peak theoretical numbers. It’s not possible to achieve those numbers in practice, with any real world code. The codes that come closest will be large matrix-matrix multiplies, which can often possibly get to about 90% of peak theoretical.

The nbody code does have a lot of arithmetic but doesn’t present it in a way that allows it to get to even 90% of peak theoretical.

… and often employ hand-written machine code or at least carefully massaged source code to get there.

Typical straightforwardly written compiled code often maxes out around 75% to 80% of theoretical peak performance. That limitation applies to any modern compute platform, not just CUDA.

Thank you for your reply

But is there have some tools or softwares Recommend to me used for test the performace of current gpu products

Try this: Multi-GPU CUDA stress test
It’s more of a stress test and gets close to peak theoretical flops.