Interpreting nvidia-smi output

Hello,

We have a box presumably(as I was told) with 4xK80 GPU’s. But the out of nvidia-smi command tells a different story, it says we have 8. I am misinterpreting this?

nvidia-smi -L

GPU 0: Tesla K80 (UUID: GPU-4376cf29-89af-xxx…)
GPU 1: Tesla K80 (UUID: GPU-7b96af99-9d86-xxx…)
GPU 2: Tesla K80 (UUID: GPU-6166e2ed-a2d0-xxx…)
GPU 3: Tesla K80 (UUID: GPU-2969a997-8837-xxx…)
GPU 4: Tesla K80 (UUID: GPU-a1dda04e-1c02-xxx…)
GPU 5: Tesla K80 (UUID: GPU-179a403b-8529-xxx…)
GPU 6: Tesla K80 (UUID: GPU-33e731dd-fee2-xxx…)
GPU 7: Tesla K80 (UUID: GPU-e48856c6-ff13-xxx…)

So does the query command tell me as below:

nvidia-smi -i 0 -q

==============NVSMI LOG==============

Timestamp : Mon Aug 22 13:36:18 2016
Driver Version : 352.93

Attached GPUs : 8
GPU 0000:06:00.0
Product Name : Tesla K80
Product Brand : Tesla
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 1920

Please advise as am I confused. I don’t know of a rack mount server that can pack 8xK80’s in a 1U form factor, hence I want to correct my interpretation.

regards,
Amit

A K80 has 2 GPU devices in a single K80. From a programming perspective they are treated as separate GPUs and nvidia-smi reports them as 2 separate GPUs (for each).

Great thank you for the clarification. I get it now!!

Does it mean that these kind of devices can be used for multi-GPU software developement? Can they be completely treated as two independent devices on a single node? Thank you in advance!

Yes.

Hello.

We have such a problem:
there is a 2U server-Supermicro AS-4124GS-TNR.
It has 8 Nvidia Tesla K80 video cards installed.
OS - ubuntu-server 20.04.2 LTS (Focal Fossa).
Drivers installed:
Driver Version: 460.73.01
CUDA Version: 11.2

In this case, each individual card is defined in the system as two:

nvidia-smi -L

GPU 0: Tesla K80 (UUID: GPU-f3bba1d8-4bb6-7c01-051e-6bac32ccc8ae)
GPU 1: Tesla K80 (UUID: GPU-59c78925-d5cb-cda6-aead-dde736e7c66d)

nvidia-smi -i 0 -q

==============NVSMI LOG==============

Timestamp : Wed May 19 10:44:13 2021
Driver Version : 460.73.01
CUDA Version : 11.2

Attached GPUs : 2
GPU 00000000:E3:00.0
Product Name : Tesla K80
Product Brand : Tesla
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Disabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0320816054366
GPU UUID : GPU-f3bba1d8-4bb6-7c01-051e-6bac32ccc8ae
Minor Number : 0
VBIOS Version : 80.21.1F.00.07
MultiGPU Board : Yes
Board ID : 0xe100
GPU Part Number : 900-22080-6300-000
Inforom Version
Image Version : 2080.0200.00.04
OEM Object : 1.1
ECC Object : 3.0
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0xE3
Device : 0x00
Domain : 0x0000
Device Id : 0x102D10DE
Bus Id : 00000000:E3:00.0
Sub System Id : 0x106C10DE
GPU Link Info
PCIe Generation
Max : 3
Current : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : N/A
Rx Throughput : N/A
Fan Speed : N/A
Performance State : P0
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : N/A
HW Power Brake Slowdown : N/A
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 11441 MiB
Used : 0 MiB
Free : 11441 MiB
BAR1 Memory Usage
Total : 16384 MiB
Used : 2 MiB
Free : 16382 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Texture Shared : N/A
CBU : N/A
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Texture Shared : N/A
CBU : N/A
Total : 0
Aggregate
Single Bit
Device Memory : 19
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Texture Shared : N/A
CBU : N/A
Total : 19
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Texture Shared : N/A
CBU : N/A
Total : 0
Retired Pages
Single Bit ECC : 1
Double Bit ECC : 0
Pending Page Blacklist : No
Remapped Rows : N/A
Temperature
GPU Current Temp : 53 C
GPU Shutdown Temp : 93 C
GPU Slowdown Temp : 88 C
GPU Max Operating Temp : N/A
GPU Target Temperature : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 58.19 W
Power Limit : 149.00 W
Default Power Limit : 149.00 W
Enforced Power Limit : 149.00 W
Min Power Limit : 100.00 W
Max Power Limit : 175.00 W
Clocks
Graphics : 562 MHz
SM : 562 MHz
Memory : 2505 MHz
Video : 540 MHz
Applications Clocks
Graphics : 562 MHz
Memory : 2505 MHz
Default Applications Clocks
Graphics : 562 MHz
Memory : 2505 MHz
Max Clocks
Graphics : 562 MHz
SM : 562 MHz
Memory : 2505 MHz
Video : 540 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : On
Auto Boost Default : On
Processes : None

At the same time, in the output, nvidia-sli shows that one of the cards is not disposed of, and the second is 90% disposed of. At the same time, there are no services running on the server that could be used by graphics cards.

nvidia-smi
Wed May 19 10:45:40 2021
±----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: 11.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:E3:00.0 Off | 0 |
| N/A 53C P0 58W / 149W | 0MiB / 11441MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 Tesla K80 Off | 00000000:E4:00.0 Off | 0 |
| N/A 39C P0 71W / 149W | 0MiB / 11441MiB | 90% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

What do you recommend to do and what is the reason for this problem?