Quick sanity check please! (GTX 1080 performance, CUDA Nbody sample)

CUDA 8 SDK Final, 373.06 drivers, Windows 10 Anniversary update
Intel i7-5930K, 32GB RAM

If you have a GTX 1080 system, could you please confirm whether these performance figures are normal or not? I’m not trying to conduct the world’s most rigorous benchmark, but regardless a quick A/B would be very helpful. Many thanks in advance.

nbody.exe -benchmark -numdevices=1

number of CUDA devices = 1

Windowed mode
Simulation data stored in video memory
Single precision floating point simulation
1 Devices used for simulation
GPU Device 0: “GeForce GTX 1080” with compute capability 6.1

Compute 6.1 CUDA device: [GeForce GTX 1080]
20480 bodies, total time for 10 iterations: 18.530 ms
= 226.348 billion interactions per second
= 4526.967 single-precision GFLOP/s at 20 flops per interaction

nbody.exe -benchmark -numdevices=2

number of CUDA devices = 2

Windowed mode
Simulation data stored in system memory
Single precision floating point simulation
2 Devices used for simulation
GPU Device 0: “GeForce GTX 1080” with compute capability 6.1

Compute 6.1 CUDA device: [GeForce GTX 1080]
Compute 6.1 CUDA device: [GeForce GTX 1080]
40960 bodies, total time for 10 iterations: 40.441 ms
= 414.858 billion interactions per second
= 8297.167 single-precision GFLOP/s at 20 flops per interaction

On my system (Win 7 x64, GTX 1080, GTX 690, CUDA 8.0, 16 GB RAM, Intel Core i7-3770K), I get the following results:

number of CUDA devices  = 1
> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
GPU Device 0: "GeForce GTX 1080" with compute capability 6.1

> Compute 6.1 CUDA device: [GeForce GTX 1080]
20480 bodies, total time for 10 iterations: 19.954 ms
= 210.202 billion interactions per second
= 4204.044 single-precision GFLOP/s at 20 flops per interaction

Thank you phw89! Much appreciated.

number of CUDA devices  = 2
> Windowed mode
> Simulation data stored in system memory
> Single precision floating point simulation
> 2 Devices used for simulation
GPU Device 0: "GeForce GTX 1080" with compute capability 6.1

> Compute 6.1 CUDA device: [GeForce GTX 1080]
> Compute 6.1 CUDA device: [GeForce GTX 1080]
40960 bodies, total time for 10 iterations: 7385.681 ms
= 2.272 billion interactions per second
= 45.432 single-precision GFLOP/s at 20 flops per interaction

I’m not getting anywhere near that for some reason :(
Any ideas? Have you guys seen this before?

I’m running on Windows 10

rgee27, I accidentally ran this sample in debug mode the other day and got the same performance figure. Recompile in release mode and things should be back to normal.

RobbieBC, Thanks! That did the trick.

number of CUDA devices  = 2
> Windowed mode
> Simulation data stored in system memory
> Single precision floating point simulation
> 2 Devices used for simulation
GPU Device 0: "GeForce GTX 1080" with compute capability 6.1

> Compute 6.1 CUDA device: [GeForce GTX 1080]
> Compute 6.1 CUDA device: [GeForce GTX 1080]
40960 bodies, total time for 10 iterations: 46.595 ms
= 360.064 billion interactions per second
= 7201.283 single-precision GFLOP/s at 20 flops per interaction