Unable to determine the device handle for GPU0000:98:00.0: Unknown Error while running deepstream application

OS: Ubuntu 22.04
GPU: Tesla T4
Server: DELL PowerEdge R750
Driver version: 550.90.07
CUDA: 12.4

nvidia-bug-report
nvidia-bug-report.log.gz (1.4 MB)

Hi I have a problem running deepstream application on T4.

The server has 4 T4 GPUs and we run 4 same applications on GPUs(1 application for each GPU device).
While running the application, we get the following error and the application stops.

Unable to determine the device handle for GPU0000:98:00.0: Unknown Error

And if I reboot the server, all gpus are connected and work fine. But eventually, same thing happens.
This is the syslog when the gpu is not determined.

Aug 28 18:40:31 svan4 kernel: [13048.824585] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
Aug 28 18:40:31 svan4 kernel: [13048.824747] {1}[Hardware Error]: event severity: recoverable
Aug 28 18:40:31 svan4 kernel: [13048.824800] {1}[Hardware Error]:  Error 0, type: fatal
Aug 28 18:40:31 svan4 kernel: [13048.824863] {1}[Hardware Error]:   section_type: PCIe error
Aug 28 18:40:31 svan4 kernel: [13048.824923] {1}[Hardware Error]:   port_type: 0, PCIe end point
Aug 28 18:40:31 svan4 kernel: [13048.824972] {1}[Hardware Error]:   version: 3.0
Aug 28 18:40:31 svan4 kernel: [13048.825014] {1}[Hardware Error]:   command: 0x0407, status: 0x0010
Aug 28 18:40:31 svan4 kernel: [13048.825087] {1}[Hardware Error]:   device_id: 0000:98:00.0
Aug 28 18:40:31 svan4 kernel: [13048.825170] {1}[Hardware Error]:   slot: 6
Aug 28 18:40:31 svan4 kernel: [13048.825221] {1}[Hardware Error]:   secondary_bus: 0x00
Aug 28 18:40:31 svan4 kernel: [13048.825291] {1}[Hardware Error]:   vendor_id: 0x10de, device_id: 0x1eb8
Aug 28 18:40:31 svan4 kernel: [13048.825356] {1}[Hardware Error]:   class_code: 030200
Aug 28 18:40:31 svan4 kernel: [13048.825409] {1}[Hardware Error]:   aer_uncor_status: 0x00004000, aer_uncor_mask: 0x00010000
Aug 28 18:40:31 svan4 kernel: [13048.825504] {1}[Hardware Error]:   aer_uncor_severity: 0x0046f030
Aug 28 18:40:31 svan4 kernel: [13048.825581] {1}[Hardware Error]:   TLP Header: 00000000 00000000 00000000 00000000

I’ve attached the bug report above.
Thanks in advance :)