Peformance comparison ends in strange results

Hey there,

I wanted to check the performance of single processing steps, which are included in an algorithm. The project to check the performance can be found on https://github.com/One3Three7/CUDARangeDopplerProcessing. (It should be mentioned that the project does not include the validation of calculated results.)

Anyways, the performance comparison was executed on a GeForce GTX 960m and also on an Quadro P6000. What I was mistaken for was the weaker mobile GPU (960m) delivered much better results than the stronger P6000, especially in terms of FFT calculations. For example, the mobile device needs 17.64ms to calculate FFTs (NX = 4096; Batch = 8192; Type = C2C) while the Quadro needs around 28.57 ms. Much more critical is the calculation of the Hilbert transformation (same data size as the considered FFT operation), which requires 94.91 ms on the 960m and 259.72 ms on the P6000.

The observation of nvidia-smi revealed the following; the volatile GPU load was almost 100%, whereas the power consumption of the Quadro P6000 during the runtime was just up to 30 W of possibly 250 W.

Thu Aug  8 16:55:05 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.14       Driver Version: 430.14       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro P6000        Off  | 00000000:05:00.0 Off |                  Off |
| 26%   43C    P8    25W / 250W |   1230MiB / 24447MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      3322      C   ...SoftwareProjects/Cuda/Test1/Debug/Test1  1073MiB |
|    0      3470      C   /usr/lib/libreoffice/program/soffice.bin     147MiB |
+-----------------------------------------------------------------------------+

I’m a bit unaware at the moment, but I’m still sure that the Quadro should deliver much better results than the GeForce.

Another thing that seems strange to me is nvidia-smi shows Cuda version 10.2, but actually nvcc uses 10.1.

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Apr_24_19:10:27_PDT_2019
Cuda compilation tools, release 10.1, V10.1.168

I am grateful for advice!
Niclas

If it were me, I wouldn’t let libreoffice use my GPU while I was trying to run performance tests.

If it were me, I wouldn’t time extraneous things like printf statements

If it were me, to see if there actually was a problem with FFT performance, I would develop a simple code that just tested an FFT operation, and run that.

Is it the same OS and CUDA version on each machine?
Are you using the same makefile and project code in each test?

What is the output of

nvidia-smi -a

when the test code is running?

If it were me, I wouldn’t let libreoffice use my GPU while I was trying to run performance tests.
Normally I run the performance tests isolated on the GPU, but thanks for the hint.

If it were me, I wouldn’t time extraneous things like printf statements
You are totally right, I removed these statements.

If it were me, to see if there actually was a problem with FFT performance, I would develop a simple code that just tested an FFT operation, and run that.
Yep, I already did it and apart from a slight performance improvement, the GPU did not work at full (power)load either. The results on the Quadro still remain largely unsatisfactory. However, for my purposes it is necessary to test all processing steps one after the other, since these serve for comparing purposes of the entire algorithm between different data sizes.

The OS on both computers are the same (btw. it is Ubuntu 18.04) and also the CUDA Versions are equal, as long as the output of the CUDA version in nvidia-smi can be ignored.

CUDARangeDopplerProcessing$ nvidia-smi -a

==============NVSMI LOG==============

Timestamp                           : Fri Aug  9 17:38:07 2019
Driver Version                      : 430.14
CUDA Version                        : 10.2

Attached GPUs                       : 1
GPU 00000000:05:00.0
    Product Name                    : Quadro P6000
    Product Brand                   : Quadro
    Display Mode                    : Disabled
    Display Active                  : Disabled
    Persistence Mode                : Disabled
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 4000
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : 0322118058547
    GPU UUID                        : GPU-5ec29e7a-fef0-f1d0-f85d-d428375b6e15
    Minor Number                    : 0
    VBIOS Version                   : 86.02.2D.00.04
    MultiGPU Board                  : No
    Board ID                        : 0x500
    GPU Part Number                 : 900-5G611-1700-000
    Inforom Version
        Image Version               : G611.0500.00.02
        OEM Object                  : 1.1
        ECC Object                  : 4.1
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    GPU Virtualization Mode
        Virtualization mode         : None
    IBMNPU
        Relaxed Ordering Mode       : N/A
    PCI
        Bus                         : 0x05
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x1B3010DE
        Bus Id                      : 00000000:05:00.0
        Sub System Id               : 0x11A010DE
        GPU Link Info
            PCIe Generation
                Max                 : 3
                Current             : 1
            Link Width
                Max                 : 16x
                Current             : 16x
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays Since Reset         : 0
        Replay Number Rollovers     : 0
        Tx Throughput               : 0 KB/s
        Rx Throughput               : 9000 KB/s
    Fan Speed                       : 26 %
    Performance State               : P8
    Clocks Throttle Reasons
        Idle                        : Not Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
            HW Thermal Slowdown     : Not Active
            HW Power Brake Slowdown : Not Active
        Sync Boost                  : Not Active
        SW Thermal Slowdown         : Not Active
        Display Clock Setting       : Not Active
    FB Memory Usage
        Total                       : 24447 MiB
        Used                        : 4905 MiB
        Free                        : 19542 MiB
    BAR1 Memory Usage
        Total                       : 256 MiB
        Used                        : 2 MiB
        Free                        : 254 MiB
    Compute Mode                    : Default
    Utilization
        Gpu                         : 99 %
        Memory                      : 66 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Encoder Stats
        Active Sessions             : 0
        Average FPS                 : 0
        Average Latency             : 0
    FBC Stats
        Active Sessions             : 0
        Average FPS                 : 0
        Average Latency             : 0
    Ecc Mode
        Current                     : Disabled
        Pending                     : Disabled
    ECC Errors
        Volatile
            Single Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
            Double Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
        Aggregate
            Single Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
            Double Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
    Retired Pages
        Single Bit ECC              : N/A
        Double Bit ECC              : N/A
        Pending Page Blacklist      : N/A
    Temperature
        GPU Current Temp            : 44 C
        GPU Shutdown Temp           : 100 C
        GPU Slowdown Temp           : 97 C
        GPU Max Operating Temp      : N/A
        Memory Current Temp         : N/A
        Memory Max Operating Temp   : N/A
    Power Readings
        Power Management            : Supported
        Power Draw                  : 30.21 W
        Power Limit                 : 250.00 W
        Default Power Limit         : 250.00 W
        Enforced Power Limit        : 250.00 W
        Min Power Limit             : 125.00 W
        Max Power Limit             : 250.00 W
    Clocks
        Graphics                    : 139 MHz
        SM                          : 139 MHz
        Memory                      : 405 MHz
        Video                       : 544 MHz
    Applications Clocks
        Graphics                    : 1506 MHz
        Memory                      : 4513 MHz
    Default Applications Clocks
        Graphics                    : 1506 MHz
        Memory                      : 4513 MHz
    Max Clocks
        Graphics                    : 1657 MHz
        SM                          : 1657 MHz
        Memory                      : 4513 MHz
        Video                       : 1493 MHz
    Max Customer Boost Clocks
        Graphics                    : 1657 MHz
    Clock Policy
        Auto Boost                  : N/A
        Auto Boost Default          : N/A
    Processes
        Process ID                  : 7466
            Type                    : C
            Name                    : bin/run
            Used GPU Memory         : 4895 MiB

Something that needs to be said: After I published the article here and also did some changes afterwards, tests turned out as I expected them to. At the same time the power consumption of the GPU increased from 30W to 160-180W. After restarting the computer today, it was not possible to reproduce the “good” results and again the Quadro does not exceed 30W. I am not able to reproduce the issue, but I’m relatively sure it’s not my code and probably the GPU-configuration.

I still have questions:

Does the Quadro P6000 have different power modes that can be manually configured? If so, how does it work?

Regarding the reporting of CUDA version in nvidia-smi you may wish to read this:

https://stackoverflow.com/questions/53422407/different-cuda-versions-shown-by-nvcc-and-nvidia-smi

The GPU gives no direct control over power state. You have some indirect control via setting application clocks using nvidia-smi (use command line help -h to learn about it) however I don’t know if Quadro supports modifying application clocks, and anyway your application clocks and default application clocks are not low.

Its possible that something is restricting GPU clocks but I’m getting no indication of that from nvidia-smi. Another possibility is that the bursts of testing are too short to cause the GPU to come out of P8 state for any significant/observable period of time. Normally the GPU is operating at full performance mode when the power state is P0, P1, or P2.

You could try opening a separate terminal, and from that terminal run nvidia-smi looping (nvidia-smi -l) and then in another terminal run your test to see if any of the observations are different.

Alternatively, take your individual FFT test and run 10 or 100 FFTs in a row, back-to-back, and see if the observations are different.