why gpu0 is much slower than gpu1 (v100, ubuntu 16.04)?

huzhongshan · October 13, 2018, 10:21pm

we have computers with 2 v100 cards installed. we found that gpu1 is much faster than gpu0 ( abount 2-5x) by using same program and same dataset. we use ubuntu 16.04 , and cuda 9.1 ,cudnn 7.

we have two computers each installed 2 v100 cards and one computer installed 4 1080ti cards. the 4-card machine works well.

the two v100 machines both show gpu0 much slower than gpu1.

njuffa · October 13, 2018, 10:58pm

From the application side, what guarantees proper load balancing between GPUsin a dual GPU configuration?
When you swap the two GPUs in their PCIe slots, does the slowness stay with the slot or follow the GPU?
Are the PCIe slots of both GPUs configured as gen3 x16?
Has power supply and cooling been checked for the slow GPU?
Is this a system with multiple CPU sockets?

huzhongshan · October 14, 2018, 7:30am

we use the same program to run gpu0 and gpu1.
the gpu0 and gpu1 boths work. but gpu0 is much slower than gpu1.

we have 2 machines ,both installed 2 v100 cards. and they get the same result ( i.e. gpu0 is much slower than gpu1).
Any suggestion?

Robert_Crovella · October 14, 2018, 12:27pm

60C is quite hot for an idle GPU, especially when there is another active GPU in the server

Possibly a cooling issue. Did you install these V100 GPUs yourself? or did you purchase them pre-installed in an OEM server that is properly certified for Tesla V100 PCIE?

What server are these GPUs installed in?

You would be less likely to run into a corresponding cooling problem with 1080ti cards, as they provide their own cooling.

krzysg · October 15, 2018, 9:47am

Can you run:

nvidia-smi -q

sudo lspci -vv -s 00000000:02:00.0
sudo lspci -vv -s 00000000:03:00.0

to provide some more info.

It would be also good to find out what does it mean ‘slower’:

are memory transfers to/from GPUs slow, or
computations on GPU are slow, or
both

I know that from application point of view it might be hard to answer but you can run bandwidthTest (from CUDA samples) on both cards and it would give some answers.

huzhongshan · October 15, 2018, 11:22pm

==============NVSMI LOG==============

Timestamp : Tue Oct 16 07:19:18 2018
Driver Version : 396.26

Attached GPUs : 2
GPU 00000000:02:00.0
Product Name : Tesla V100-PCIE-16GB
Product Brand : Tesla
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0321218058028
GPU UUID : GPU-3519976c-00d2-497b-8bd1-9711acddfc24
Minor Number : 0
VBIOS Version : 88.00.1A.00.03
MultiGPU Board : No
Board ID : 0x200
GPU Part Number : 900-2G500-0000-000
Inforom Version
Image Version : G500.0200.00.03
OEM Object : 1.1
ECC Object : 5.0
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization mode : None
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x02
Device : 0x00
Domain : 0x0000
Device Id : 0x1DB410DE
Bus Id : 00000000:02:00.0
Sub System Id : 0x121410DE
GPU Link Info
PCIe Generation
Max : 3
Current : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays since reset : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : N/A
Performance State : P0
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
FB Memory Usage
Total : 16160 MiB
Used : 11 MiB
Free : 16149 MiB
BAR1 Memory Usage
Total : 16384 MiB
Used : 4 MiB
Free : 16380 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : N/A
Texture Shared : N/A
CBU : 0
Total : 0
Aggregate
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : N/A
Texture Shared : N/A
CBU : 0
Total : 0
Retired Pages
Single Bit ECC : 0
Double Bit ECC : 0
Pending : No
Temperature
GPU Current Temp : 59 C
GPU Shutdown Temp : 90 C
GPU Slowdown Temp : 87 C
GPU Max Operating Temp : 83 C
Memory Current Temp : 55 C
Memory Max Operating Temp : 85 C
Power Readings
Power Management : Supported
Power Draw : 30.71 W
Power Limit : 250.00 W
Default Power Limit : 250.00 W
Enforced Power Limit : 250.00 W
Min Power Limit : 100.00 W
Max Power Limit : 250.00 W
Clocks
Graphics : 135 MHz
SM : 135 MHz
Memory : 877 MHz
Video : 555 MHz
Applications Clocks
Graphics : 1245 MHz
Memory : 877 MHz
Default Applications Clocks
Graphics : 1245 MHz
Memory : 877 MHz
Max Clocks
Graphics : 1380 MHz
SM : 1380 MHz
Memory : 877 MHz
Video : 1237 MHz
Max Customer Boost Clocks
Graphics : 1380 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes : None

GPU 00000000:03:00.0
Product Name : Tesla V100-PCIE-16GB
Product Brand : Tesla
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0321218057997
GPU UUID : GPU-1538c1cb-6887-99c4-7acf-1239d4d61991
Minor Number : 1
VBIOS Version : 88.00.1A.00.03
MultiGPU Board : No
Board ID : 0x300
GPU Part Number : 900-2G500-0000-000
Inforom Version
Image Version : G500.0200.00.03
OEM Object : 1.1
ECC Object : 5.0
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization mode : None
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x03
Device : 0x00
Domain : 0x0000
Device Id : 0x1DB410DE
Bus Id : 00000000:03:00.0
Sub System Id : 0x121410DE
GPU Link Info
PCIe Generation
Max : 3
Current : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays since reset : 0
Tx Throughput : 178000 KB/s
Rx Throughput : 1467000 KB/s
Fan Speed : N/A
Performance State : P0
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Active
FB Memory Usage
Total : 16160 MiB
Used : 6615 MiB
Free : 9545 MiB
BAR1 Memory Usage
Total : 16384 MiB
Used : 8 MiB
Free : 16376 MiB
Compute Mode : Default
Utilization
Gpu : 100 %
Memory : 9 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : N/A
Texture Shared : N/A
CBU : 0
Total : 0
Aggregate
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : N/A
Texture Shared : N/A
CBU : 0
Total : 0
Retired Pages
Single Bit ECC : 0
Double Bit ECC : 0
Pending : No
Temperature
GPU Current Temp : 82 C
GPU Shutdown Temp : 90 C
GPU Slowdown Temp : 87 C
GPU Max Operating Temp : 83 C
Memory Current Temp : 82 C
Memory Max Operating Temp : 85 C
Power Readings
Power Management : Supported
Power Draw : 56.14 W
Power Limit : 250.00 W
Default Power Limit : 250.00 W
Enforced Power Limit : 250.00 W
Min Power Limit : 100.00 W
Max Power Limit : 250.00 W
Clocks
Graphics : 367 MHz
SM : 367 MHz
Memory : 877 MHz
Video : 555 MHz
Applications Clocks
Graphics : 1245 MHz
Memory : 877 MHz
Default Applications Clocks
Graphics : 1245 MHz
Memory : 877 MHz
Max Clocks
Graphics : 1380 MHz
SM : 1380 MHz
Memory : 877 MHz
Video : 1237 MHz
Max Customer Boost Clocks
Graphics : 1380 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes
Process ID : 4715
Type : C
Name : /home/**********************
Used GPU Memory : 6604 MiB

=================================================================================================================
sudo lspci -vv -s 00000000:02:00.0

02:00.0 3D controller: NVIDIA Corporation Device 1db4 (rev a1)
Subsystem: NVIDIA Corporation Device 1214
Physical Slot: 13
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 80
Region 0: Memory at d2000000 (32-bit, non-prefetchable)
Region 1: Memory at 383800000000 (64-bit, prefetchable)
Region 3: Memory at 383c00000000 (64-bit, prefetchable)
Capabilities: [60] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 00000000fee00558 Data: 0000
Capabilities: [78] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop-
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency L0s <1us, L1 <4us
ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR+, OBFF Via message
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
Capabilities: [100 v1] Virtual Channel
Caps: LPEVC=0 RefClk=100ns PATEntryBits=1
Arb: Fixed- WRR32- WRR64- WRR128-
Ctrl: ArbSelect=Fixed
Status: InProgress-
VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=01
Status: NegoPending- InProgress-
Capabilities: [250 v1] Latency Tolerance Reporting
Max snoop latency: 0ns
Max no snoop latency: 0ns
Capabilities: [258 v1] L1 PM Substates
L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
PortCommonModeRestoreTime=255us PortTPowerOnTime=10us
Capabilities: [128 v1] Power Budgeting <?> Capabilities: [420 v2] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn- Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Capabilities: [900 v1] #19
Capabilities: [ac0 v1] #23
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

03:00.0 3D controller: NVIDIA Corporation Device 1db4 (rev a1)
Subsystem: NVIDIA Corporation Device 1214
Physical Slot: 17
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 81
Region 0: Memory at d1000000 (32-bit, non-prefetchable)
Region 1: Memory at 383000000000 (64-bit, prefetchable)
Region 3: Memory at 383400000000 (64-bit, prefetchable)
Capabilities: [60] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 00000000fee00578 Data: 0000
Capabilities: [78] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop-
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend+
LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency L0s <1us, L1 <4us
ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR+, OBFF Via message
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
Capabilities: [100 v1] Virtual Channel
Caps: LPEVC=0 RefClk=100ns PATEntryBits=1
Arb: Fixed- WRR32- WRR64- WRR128-
Ctrl: ArbSelect=Fixed
Status: InProgress-
VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=01
Status: NegoPending- InProgress-
Capabilities: [250 v1] Latency Tolerance Reporting
Max snoop latency: 0ns
Max no snoop latency: 0ns
Capabilities: [258 v1] L1 PM Substates
L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
PortCommonModeRestoreTime=255us PortTPowerOnTime=10us
Capabilities: [128 v1] Power Budgeting <?> Capabilities: [420 v2] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn- Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Capabilities: [900 v1] #19
Capabilities: [ac0 v1] #23
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

huzhongshan · October 15, 2018, 11:26pm

our company buy the pre-installed machine. but I am sure that the seller is certified for Tesla V100 PCIE. do we need to open the machine to see the cooling fan of v100 ? we are not expert at the hardware.

njuffa · October 15, 2018, 11:47pm

A V100 does not have its own fan. It is a passively cooled device that relies on adequate airflow being provided by the fans in the server enclosure.

Other than the unusually high temperature of GPU0 (59 deg C when completely idle, instead of about 40 deg C expected), I cannot spot anything unusual in the logs you posted.

So the hypothesis that GPU0 is slow because of thermal throttling still holds. You should be able to verify by putting a load on GPU0, then monitor nvidia-smi output for temperature and throttle reasons.

You can then turn to your system integrator to have them resolve the issue. Presumably they will find an non-operating fan bank (maybe defective, maybe not plugged in) or a major obstruction of airflow past GPU0.

NVIDIA Tesla partner companies are listed here: [url]https://www.nvidia.com/en-us/data-center/where-to-buy-tesla/[/url]

Topic		Replies	Views
The GPU FAN runs heavily after the process is done. CUDA Setup and Installation	19	5453	July 20, 2017
Problems after inserting a P100 CUDA Setup and Installation	35	4360	October 31, 2021
GPU performance suddenly drops down twice during learning CUDA Programming and Performance	11	4073	November 10, 2018
Multi-GPU performance incredibly slow CUDA Programming and Performance	7	3350	January 2, 2020
HW Power Brake Slowdown CUDA Programming and Performance	7	5485	February 27, 2021
Tesla V100 SW Thermal Slowdown active GPU-Accelerated Libraries cuda	1	1840	December 10, 2020
Bios usage of dual cards CUDA Programming and Performance	18	5327	July 16, 2014
I have 2 A100 cards, when switching between them with CUDA_VISIBLE_DEVICES on is ~ 2x slower than the other CUDA Programming and Performance	9	637	February 9, 2024
Why 2RTX 2080ti run slower than 2Tesla P100？ CUDA Programming and Performance	17	5603	July 6, 2019
multiGPU poor performance up to 10x lowest performance in multiGPU CUDA Programming and Performance	14	10972	January 18, 2008

why gpu0 is much slower than gpu1 (v100, ubuntu 16.04)?

Related topics