Terrible host<->device bandwidth seen with bandwidthtest

Here’s the output from the bandwidthtest sample (x86_64 release):

 Device 0: Quadro GV100
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(GB/s)
   32000000                     0.8

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(GB/s)
   32000000                     0.8

The card is in a 16x PCIe slot. The motherboard is a new Dell dual socket one, with a pair of Xeon W-2295 CPUs. I’m running Ubuntu 18. I’d expect about 10x better performance than this. Has anyone got any ideas how to debug this problem?

Here’s the output from lspci --vv:

0000:02:00.0 VGA compatible controller: NVIDIA Corporation GV100GL [Quadro GV100] (rev a1) (prog-if 00 [VGA controller])
        Subsystem: Dell Device 121a
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 135
        NUMA node: 0
        Region 0: Memory at 91000000 (32-bit, non-prefetchable) [size=16M]
        Region 1: Memory at 70000000 (64-bit, prefetchable) [size=256M]
        Region 3: Memory at 80000000 (64-bit, prefetchable) [size=32M]
        Region 5: I/O ports at 2000 [size=128]
        [virtual] Expansion ROM at 92000000 [disabled] [size=512K]
        Capabilities: [60] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
                Address: 00000000fee00778  Data: 0000
        Capabilities: [78] Express (v2) Legacy Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop-
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <1us, L1 <4us
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM L1 Enabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 8GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR+, OBFF Via message
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled
                LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
                         EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
        Capabilities: [100 v1] Virtual Channel
                Caps:   LPEVC=0 RefClk=100ns PATEntryBits=1
                Arb:    Fixed- WRR32- WRR64- WRR128-
                Ctrl:   ArbSelect=Fixed
                Status: InProgress-
                VC0:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
                        Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
                        Ctrl:   Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
                        Status: NegoPending- InProgress-
        Capabilities: [250 v1] Latency Tolerance Reporting
                Max snoop latency: 34326183936ns
                Max no snoop latency: 34326183936ns
        Capabilities: [258 v1] L1 PM Substates
                L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
                          PortCommonModeRestoreTime=255us PortTPowerOnTime=10us
                L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
                           T_CommonMode=0us LTR1.2_Threshold=0ns
                L1SubCtl2: T_PwrOn=44us
        Capabilities: [128 v1] Power Budgeting <?>
        Capabilities: [420 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
        Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Capabilities: [900 v1] #19
        Capabilities: [ac0 v1] #23
        Kernel driver in use: nvidia
        Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

You mention a “Dell motherboard”. So is this a Dell system? If so, what model? Was this system shipped with the GPU installed, or did you install the GPU yourself?

If this is a Dell system, and it came with the GPU already installed, contact Dell support. My personal experience with them is good.

If you installed the GPU yourself, double check your assertion that the GPU is installed in a PCIe gen 3 x16 slot. While a CUDA-based app is running, run nvidia-smi -q and look at the link section of the output. It should look like something like this:

   GPU Link Info
   PCIe Generation
       Max                 : 3
       Current             : 3
   Link Width
       Max                 : 16x
       Current             : 16x

If the current link generation is not “3” or the current link width is not “16x”, check for other PCIe slots on the motherboard and check BIOS settings.

Check the power supply of the GPU. There should be two PCIe auxilliary power cables going to the GPU, one with a 6-pin connector and one with an 8-pin connector. Make sure the connectors are properly inserted (the tab should engage). What’s the power rating (wattage) of the power supply (PSU) in this system? It should be 1000W (or more, depending on the overall system configuration which wasn’t disclosed).

While a CUDA-based app is running, check the section “Temperature”, specifically the item “GPU Current Temperature” for excessively high temperature. Also check the section “Clocks Throttle Reasons” for any active reasons.

1 Like

Thanks for the excellent reply. Good point on Dell support. I don’t know if the GPU was pre-installed since the machine was setup by my colleagues. I will make inquiries.

nvidia-smi -q shows:

 GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 3
            Link Width
                Max                       : 16x
                Current                   : 1x

I guess I’ll check the BIOS next and physically inspect the PCIe slots.

If it is still of any value, dmidecode -t 2 gives:

# dmidecode 3.1
Getting SMBIOS data from sysfs.
SMBIOS 3.2.0 present.
# SMBIOS implementations newer than version 3.1.1 are not
# fully supported by this version of dmidecode.

Handle 0x0002, DMI type 2, 15 bytes
Base Board Information
        Manufacturer: Dell Inc.
        Product Name: 06JWJY
        Version: A00
        Serial Number: /5MR0C43/CNFCW0001K011H/
        Asset Tag: Not Specified
                Board is a hosting board
                Board is replaceable
        Location In Chassis: Not Specified
        Chassis Handle: 0x0003
        Type: Motherboard
        Contained Object Handles: 0

It looks to me like the GPU is simply installed in the wrong PCIe slot. While logical slot configuration is sometimes selectable in the BIOS, I have yet to encounter a BIOS setup that offers a 1x vs 16x configuration for the same physical slot.

I recall one instance where the slot configuration was “derated” because the GPU was not correctly seated in the PCIe slot. Problem was that was not apparent during a quick visual inspection. Removing the GPU and re-inserting it fixed that particular puzzling issue that caused PCIe throughput to be cut in half.

Re-seating the card made no difference, but moving it to a different slot fixed the problem.

Thanks for your help njuffa!