OS: Ubuntu 22.04
GPU: Tesla T4
Server: DELL PowerEdge R750
Driver version: 550.90.07
CUDA: 12.4
nvidia-bug-report
nvidia-bug-report.log.gz (1.4 MB)
Hi I have a problem running deepstream application on T4.
- Deepstream 6.3, docker image based on nvcr.io/nvidia/deepstream:6.3-samples
The server has 4 T4 GPUs and we run 4 same applications on GPUs(1 application for each GPU device).
While running the application, we get the following error and the application stops.
Unable to determine the device handle for GPU0000:98:00.0: Unknown Error
And if I reboot the server, all gpus are connected and work fine. But eventually, same thing happens.
This is the syslog when the gpu is not determined.
Aug 28 18:40:31 svan4 kernel: [13048.824585] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
Aug 28 18:40:31 svan4 kernel: [13048.824747] {1}[Hardware Error]: event severity: recoverable
Aug 28 18:40:31 svan4 kernel: [13048.824800] {1}[Hardware Error]: Error 0, type: fatal
Aug 28 18:40:31 svan4 kernel: [13048.824863] {1}[Hardware Error]: section_type: PCIe error
Aug 28 18:40:31 svan4 kernel: [13048.824923] {1}[Hardware Error]: port_type: 0, PCIe end point
Aug 28 18:40:31 svan4 kernel: [13048.824972] {1}[Hardware Error]: version: 3.0
Aug 28 18:40:31 svan4 kernel: [13048.825014] {1}[Hardware Error]: command: 0x0407, status: 0x0010
Aug 28 18:40:31 svan4 kernel: [13048.825087] {1}[Hardware Error]: device_id: 0000:98:00.0
Aug 28 18:40:31 svan4 kernel: [13048.825170] {1}[Hardware Error]: slot: 6
Aug 28 18:40:31 svan4 kernel: [13048.825221] {1}[Hardware Error]: secondary_bus: 0x00
Aug 28 18:40:31 svan4 kernel: [13048.825291] {1}[Hardware Error]: vendor_id: 0x10de, device_id: 0x1eb8
Aug 28 18:40:31 svan4 kernel: [13048.825356] {1}[Hardware Error]: class_code: 030200
Aug 28 18:40:31 svan4 kernel: [13048.825409] {1}[Hardware Error]: aer_uncor_status: 0x00004000, aer_uncor_mask: 0x00010000
Aug 28 18:40:31 svan4 kernel: [13048.825504] {1}[Hardware Error]: aer_uncor_severity: 0x0046f030
Aug 28 18:40:31 svan4 kernel: [13048.825581] {1}[Hardware Error]: TLP Header: 00000000 00000000 00000000 00000000
I’ve attached the bug report above.
Thanks in advance :)