Process hang at ioctl /dev/nvidiactl

process pid: 53664 hang at:
(gdb) bt
#0 0x0000ffff522a10fc in ?? () from target:/usr/lib/aarch64-linux-gnu/libcuda.so.1
#1 0x0000ffff522a1aec in ?? () from target:/usr/lib/aarch64-linux-gnu/libcuda.so.1
#2 0x0000ffff51fbc170 in ?? () from target:/usr/lib/aarch64-linux-gnu/libcuda.so.1
#3 0x0000ffff5208daf4 in ?? () from target:/usr/lib/aarch64-linux-gnu/libcuda.so.1
#4 0x0000ffff5227d6e8 in ?? () from target:/usr/lib/aarch64-linux-gnu/libcuda.so.1
#5 0x0000ffff52100568 in ?? () from target:/usr/lib/aarch64-linux-gnu/libcuda.so.1
#6 0x0000ffff49d51b6c in __cudart648
#7 0x0000ffff49da228c in cudaStreamSynchronize ()

mean while , strace for this pid :
strace: Process 53664 attached
16:05:36.067125 ioctl(11, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0xfffdeb1fb910) = 0 <0.000291>
16:05:37.067639 ioctl(11, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0xfffdeb1fb910) = 0 <0.000280>
16:05:38.068065 ioctl(11, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0xfffdeb1fb910) = 0 <0.000266>
16:05:39.068511 ioctl(11, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0xfffdeb1fb910) = 0 <0.000265>
16:05:40.068916 ioctl(11, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0xfffdeb1fb910) = 0 <0.000286>
16:05:41.069369 ioctl(11, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0xfffdeb1fb910) = 0 <0.000268>
16:05:42.069777 ioctl(11, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0xfffdeb1fb910) = 0 <0.000339>

fd 11 point to /dev/nvidiactl
lrwx------ 1 root root 64 Oct 17 16:07 11 ā†’ /dev/nvidiactl

Any idea what might cause this happen?

==============NVSMI LOG==============

Timestamp : Mon Oct 17 16:00:48 2022
Driver Version : 515.65.01
CUDA Version : 11.7

Attached GPUs : 2
GPU 00000000:01:00.0
Product Name : Tesla T4
Product Brand : NVIDIA
Product Architecture : Turing
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Disabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1564519008070
GPU UUID : GPU-d903d478-84b7-8816-8e12-f37e85841e51
Minor Number : 0
VBIOS Version : 90.04.38.00.03
MultiGPU Board : No
Board ID : 0x100
GPU Part Number : 900-2G183-0000-001
Module ID : 0
Inforom Version
Image Version : G183.0200.00.02
OEM Object : 1.1
ECC Object : 5.0
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : 515.65.01
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x01
Device : 0x00
Domain : 0x0000
Device Id : 0x1EB810DE
Bus Id : 00000000:01:00.0
Sub System Id : 0x12A210DE
GPU Link Info
PCIe Generation
Max : 3
Current : 3
Link Width
Max : 16x
Current : 8x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 2987000 KB/s
Rx Throughput : 461000 KB/s
Fan Speed : N/A
Performance State : P0
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 15360 MiB
Reserved : 388 MiB
Used : 12748 MiB
Free : 2223 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 86 MiB
Free : 170 MiB
Compute Mode : Default
Utilization
Gpu : 98 %
Memory : 51 %
Encoder : 33 %
Decoder : 37 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Retired Pages
Single Bit ECC : 0
Double Bit ECC : 0
Pending Page Blacklist : No
Remapped Rows : N/A
Temperature
GPU Current Temp : 58 C
GPU Shutdown Temp : 96 C
GPU Slowdown Temp : 93 C
GPU Max Operating Temp : 85 C
GPU Target Temperature : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 73.84 W
Power Limit : 70.00 W
Default Power Limit : 70.00 W
Enforced Power Limit : 70.00 W
Min Power Limit : 60.00 W
Max Power Limit : 70.00 W
Clocks
Graphics : 1275 MHz
SM : 1275 MHz
Memory : 5000 MHz
Video : 1185 MHz
Applications Clocks
Graphics : 585 MHz
Memory : 5001 MHz
Default Applications Clocks
Graphics : 585 MHz
Memory : 5001 MHz
Max Clocks
Graphics : 1590 MHz
SM : 1590 MHz
Memory : 5001 MHz
Video : 1470 MHz
Max Customer Boost Clocks
Graphics : 1590 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : N/A

  • stack corruption in host code
  • a defect of some sort in CUDA
  • mismatched or otherwise broken CUDA or GPU driver install
  • a hung kernel (a kernel that runs forever)
  • overheated or otherwise broken or defective GPU

Iā€™m sure there are other possibilities

1 Like