hi,when i used nbody test the performance of Tesla P4 device,the command is liky this:
./nbody -benchmark -numbodies=256000
And i get the result is like this:
Windowed mode
Simulation data stored in video memory
Single precision floating point simulation
1 Devices used for simulation
GPU Device 0: “Tesla P4” with compute capability 6.1
Compute 6.1 CUDA device: [Tesla P4]
number of bodies = 256000
256000 bodies, total time for 10 iterations: 3456.762 ms
= 189.588 billion interactions per second
= 3791.757 single-precision GFLOP/s at 20 flops per interaction
but in the datasheet of P4, the single-precision should be 5500 GFLOP/s.
where is the problem? and what should do to get the 5500 result.
best regards for you.
The datasheet numbers are peak theoretical numbers. It’s not possible to achieve those numbers in practice, with any real world code. The codes that come closest will be large matrix-matrix multiplies, which can often possibly get to about 90% of peak theoretical.
The nbody code does have a lot of arithmetic but doesn’t present it in a way that allows it to get to even 90% of peak theoretical.
… and often employ hand-written machine code or at least carefully massaged source code to get there.
Typical straightforwardly written compiled code often maxes out around 75% to 80% of theoretical peak performance. That limitation applies to any modern compute platform, not just CUDA.