Tesla V100 SW Thermal Slowdown active

Hi,

I have a Tesla V100 32 GB, but I am getting significant performance issues. When I start running my code on the GPU, the SW Thermal Slowdown is trigggered and the performance starts degrading, and becomes worse. Also today for the first time I observed HW Thermal Slowdown to be active. We have already put in 2 fans. I have attached the outputs below:

==============NVSMI LOG==============

Timestamp : Wed Dec 2 12:36:33 2020
Driver Version : 450.80.02
CUDA Version : 11.0

Attached GPUs : 1
GPU 00000000:82:00.0
Product Name : Tesla V100-PCIE-32GB
Product Brand : Tesla
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Enabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0423718076265
GPU UUID : GPU-ff856d5d-4dc5-82e8-35e5-bb93b5271646
Minor Number : 0
VBIOS Version : 88.00.48.00.02
MultiGPU Board : No
Board ID : 0x8200
GPU Part Number : 900-2G500-0110-030
Inforom Version
Image Version : G500.0202.00.02
OEM Object : 1.1
ECC Object : 5.0
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x82
Device : 0x00
Domain : 0x0000
Device Id : 0x1DB610DE
Bus Id : 00000000:82:00.0
Sub System Id : 0x124A10DE
GPU Link Info
PCIe Generation
Max : 3
Current : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 3000 KB/s
Rx Throughput : 1000 KB/s
Fan Speed : N/A
Performance State : P0
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Active
HW Thermal Slowdown : Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 32510 MiB
Used : 10905 MiB
Free : 21605 MiB
BAR1 Memory Usage
Total : 32768 MiB
Used : 8 MiB
Free : 32760 MiB
Compute Mode : Default
Utilization
Gpu : 100 %
Memory : 2 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : N/A
Texture Shared : N/A
CBU : 0
Total : 0
Aggregate
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : N/A
Texture Shared : N/A
CBU : 0
Total : 0
Retired Pages
Single Bit ECC : 0
Double Bit ECC : 0
Pending Page Blacklist : No
Remapped Rows : N/A
Temperature
GPU Current Temp : 89 C
GPU Shutdown Temp : 90 C
GPU Slowdown Temp : 87 C
GPU Max Operating Temp : 83 C
Memory Current Temp : 91 C
Memory Max Operating Temp : 85 C
Power Readings
Power Management : Supported
Power Draw : 62.96 W
Power Limit : 250.00 W
Default Power Limit : 250.00 W
Enforced Power Limit : 250.00 W
Min Power Limit : 100.00 W
Max Power Limit : 250.00 W
Clocks
Graphics : 67 MHz
SM : 67 MHz
Memory : 877 MHz
Video : 555 MHz
Applications Clocks
Graphics : 1230 MHz
Memory : 877 MHz
Default Applications Clocks
Graphics : 1230 MHz
Memory : 877 MHz
Max Clocks
Graphics : 1380 MHz
SM : 1380 MHz
Memory : 877 MHz
Video : 1237 MHz
Max Customer Boost Clocks
Graphics : 1380 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A

Appreciate any help in advance. Thank you.

Usually means your V100 is not being properly cooled. You shouldn’t have to put in any fans at all. A V100 should be operated in a server that is designed for it. You most likely don’t understand the level of airflow that is needed to keep it properly cooled.