GPU crash in colder temperatures

Hi,

Currently we are encountering an issue with a Jetson Nano with two Raspberry Pi Camera Module v2 cameras connected. When running the deepstream-app in an environment with temperatures between 0 - 10°C we get following errors

Jan 05 22:02:06 5498c6a kernel: host1x 50000000.host1x: syncpt_thresh_cascade_isr(): syncpoint id 0 incremented
Jan 05 22:02:06 5498c6a kernel: host1x 50000000.host1x: syncpt_thresh_cascade_isr(): syncpoint id 0 incremented
Jan 05 22:02:06 5498c6a kernel: host1x 50000000.host1x: syncpt_thresh_cascade_isr(): syncpoint id 0 incremented
Jan 05 22:02:06 5498c6a kernel: host1x 50000000.host1x: syncpt_thresh_cascade_isr(): syncpoint id 0 incremented
Jan 05 22:02:06 5498c6a kernel: host1x 50000000.host1x: syncpt_thresh_cascade_isr(): syncpoint id 0 incremented
Jan 05 22:02:06 5498c6a kernel: host1x 50000000.host1x: syncpt_thresh_cascade_isr(): syncpoint id 0 incremented
Jan 05 22:02:06 5498c6a kernel: host1x 50000000.host1x: syncpt_thresh_cascade_isr(): syncpoint id 0 incremented
Jan 05 22:02:06 5498c6a kernel: host1x 50000000.host1x: syncpt_thresh_cascade_isr(): syncpoint id 0 incremented
Jan 05 22:02:06 5498c6a kernel: host1x 50000000.host1x: syncpt_thresh_cascade_isr(): syncpoint id 0 incremented
Jan 05 22:02:06 5498c6a kernel: host1x 50000000.host1x: syncpt_thresh_cascade_isr(): syncpoint id 0 incremented
Jan 05 22:02:06 5498c6a kernel: host1x 50000000.host1x: syncpt_thresh_cascade_isr(): syncpoint id 0 incremented
...

Almost 1000 messages in the same second, and

Jan 05 22:02:05 5498c6a systemd-journald[2376]: Missed 639 kernel messages
Jan 05 22:02:05 5498c6a kernel: host1x 50000000.host1x: syncpt_thresh_cascade_isr(): syncpoint id 0 incremented
Jan 05 22:02:05 5498c6a systemd-journald[2376]: Missed 33 kernel messages
Jan 05 22:02:05 5498c6a kernel: host1x 50000000.host1x: syncpt_thresh_cascade_isr(): syncpoint id 0 incremented
Jan 05 22:02:05 5498c6a systemd-journald[2376]: Missed 39 kernel messages
Jan 05 22:02:05 5498c6a kernel: host1x 50000000.host1x: syncpt_thresh_cascade_isr(): syncpoint id 0 incremented
Jan 05 22:02:05 5498c6a systemd-journald[2376]: Missed 34 kernel messages
Jan 05 22:02:05 5498c6a kernel: host1x 50000000.host1x: syncpt_thresh_cascade_isr(): syncpoint id 0 incremented
Jan 05 22:02:05 5498c6a systemd-journald[2376]: Missed 36 kernel messages
Jan 05 22:02:05 5498c6a kernel: host1x 50000000.host1x: syncpt_thresh_cascade_isr(): syncpoint id 0 incremented
Jan 05 22:02:05 5498c6a systemd-journald[2376]: Missed 40 kernel messages
Jan 05 22:02:05 5498c6a kernel: host1x 50000000.host1x: syncpt_thresh_cascade_isr(): syncpoint id 0 incremented
Jan 05 22:02:05 5498c6a systemd-journald[2376]: Missed 35 kernel messages
Jan 05 22:02:05 5498c6a kernel: host1x 50000000.host1x: syncpt_thresh_cascade_isr(): syncpoint id 0 incremented
Jan 05 22:02:05 5498c6a systemd-journald[2376]: Missed 36 kernel messages
Jan 05 22:02:05 5498c6a kernel: host1x 50000000.host1x: syncpt_thresh_cascade_isr(): syncpoint id 0 incremented
Jan 05 22:02:05 5498c6a systemd-journald[2376]: Missed 38 kernel messages
Jan 05 22:02:05 5498c6a kernel: host1x 50000000.host1x: syncpt_thresh_cascade_isr(): syncpoint id 0 incremented
Jan 05 22:02:05 5498c6a systemd-journald[2376]: Missed 45 kernel messages
Jan 05 22:02:05 5498c6a kernel: host1x 50000000.host1x: syncpt_thresh_cascade_isr(): syncpoint id 0 incremented
Jan 05 22:02:05 5498c6a systemd-journald[2376]: Missed 44 kernel messages
Jan 05 22:02:05 5498c6a kernel: host1x 50000000.host1x: syncpt_thresh_cascade_isr(): syncpoint id 0 incremented
Jan 05 22:02:05 5498c6a systemd-journald[2376]: Missed 33 kernel messages
Jan 05 22:02:05 5498c6a kernel: host1x 50000000.host1x: syncpt_thresh_cascade_isr(): syncpoint id 0 incremented
Jan 05 22:02:05 5498c6a systemd-journald[2376]: Missed 40 kernel messages
Jan 05 22:02:05 5498c6a kernel: host1x 50000000.host1x: syncpt_thresh_cascade_isr(): syncpoint id 0 incremented
Jan 05 22:02:05 5498c6a systemd-journald[2376]: Missed 36 kernel messages
Jan 05 22:02:05 5498c6a kernel: host1x 50000000.host1x: syncpt_thresh_cascade_isr(): syncpoint id 0 incremented
Jan 05 22:02:05 5498c6a systemd-journald[2376]: Missed 33 kernel messages
Jan 05 22:02:05 5498c6a kernel: host1x 50000000.host1x: syncpt_thresh_cascade_isr(): syncpoint id 0 incremented
Jan 05 22:02:05 5498c6a systemd-journald[2376]: Missed 40 kernel messages
Jan 05 22:02:05 5498c6a kernel: host1x 50000000.host1x: syncpt_thresh_cascade_isr(): syncpoint id 0 incremented
Jan 05 22:02:05 5498c6a systemd-journald[2376]: Missed 42 kernel messages
Jan 05 22:02:05 5498c6a kernel: host1x 50000000.host1x: syncpt_thresh_cascade_isr(): syncpoint id 0 incremented
Jan 05 22:02:05 5498c6a systemd-journald[2376]: Missed 120 kernel messages
Jan 05 22:02:05 5498c6a kernel: host1x 50000000.host1x: syncpt_thresh_cascade_isr(): syncpoint id 0 incremented
Jan 05 22:02:05 5498c6a systemd-journald[2376]: Missed 39 kernel messages
Jan 05 22:02:05 5498c6a kernel: host1x 50000000.host1x: syncpt_thresh_cascade_isr(): syncpoint id 0 incremented
Jan 05 22:02:05 5498c6a systemd-journald[2376]: Missed 37 kernel messages
Jan 05 22:02:05 5498c6a kernel: host1x 50000000.host1x: syncpt_thresh_cascade_isr(): syncpoint id 0 incremented
Jan 05 22:02:05 5498c6a systemd-journald[2376]: Missed 32 kernel messages
Jan 05 22:02:05 5498c6a kernel: host1x 50000000.host1x: syncpt_thresh_cascade_isr(): syncpoint id 0 incremented
Jan 05 22:02:05 5498c6a systemd-journald[2376]: Missed 30 kernel messages
Jan 05 22:02:05 5498c6a kernel: host1x 50000000.host1x: syncpt_thresh_cascade_isr(): syncpoint id 0 incremented
Jan 05 22:02:05 5498c6a systemd-journald[2376]: Missed 36 kernel messages
...
Jan 05 22:02:05 5498c6a systemd-journald[2376]: Missed 36 kernel messages
Jan 05 22:02:05 5498c6a kernel: host1x 50000000.host1x: syncpt_thresh_cascade_isr(): syncpoint id 0 incremented
Jan 05 22:02:05 5498c6a systemd-journald[2376]: Missed 64 kernel messages
Jan 05 22:02:05 5498c6a kernel: host1x 50000000.host1x: syncpt_thresh_cascade_isr(): syncpoint id 0 incremented
Jan 05 22:02:05 5498c6a systemd-journald[2376]: Missed 622 kernel messages

More than 2000 messages in the previous second

And following error:

06.01.21 15:08:36 (+0100)  deepstream  SCF: Error Timeout: ISP port 0 timed out! (in src/services/capture/NvIspHw.cpp, function waitIspFrameEnd(), line 478)
06.01.21 15:08:36 (+0100)  deepstream  SCF: Error Timeout:  (propagating from src/services/capture/NvIspHw.cpp, function waitIspFrameEnd(), line 519)
06.01.21 15:08:36 (+0100)  deepstream  SCF: Error Timeout:  (propagating from src/common/Utils.cpp, function workerThread(), line 116)
06.01.21 15:08:36 (+0100)  deepstream  SCF: Error Timeout: Worker thread IspHw frameComplete failed (in src/common/Utils.cpp, function workerThread(), line 133)
06.01.21 15:08:36 (+0100)  deepstream  Error: waitCsiFrameStart timeout guid 1
06.01.21 15:08:36 (+0100)  deepstream  ************VI/CSI Debug Registers**********
06.01.21 15:08:36 (+0100)  deepstream  SCF: Error Timeout: ISP Stats timed out! (in src/services/capture/NvIspHw.cpp, function waitIspStatsFinished(), line 561)
06.01.21 15:08:36 (+0100)  deepstream  VI_CFG_INTERRUPT_MASK_0 = 0x00000000
06.01.21 15:08:36 (+0100)  deepstream  VI_CFG_INTERRUPT_STATUS_0 = 0x00000000
06.01.21 15:08:36 (+0100)  deepstream  VI_CSI_0_ERROR_STATUS_0 = 0x00000001
06.01.21 15:08:36 (+0100)  deepstream  VI_CSI_0_ERROR_INT_MASK_0 = 0x0000001f

Between the errors reboots are happened but it’s all on the same hardware, we have other devices running the same software without this issue. It looks like the issue happens more frequently in colder temperatures. Above 10 °C we didn’t encounter the issue a lot.
Does anybody has an idea what could be wrong and how to debug this issue?

Kind regards,
Clint

Hi,
From the log, it looks to be an issue in sensor stability. Please run below pipeline and check if the issue is seen:

$ gst-launch-1.0 nvarguscamerasrc sensor-id=0 ! 'video/x-raw(memory:NVMM),width=1920,height=1080' ! fakesink nvarguscamerasrc sensor-id=1 ! 'video/x-raw(memory:NVMM),width=1920,height=1080' ! fakesink

Thanks we’re going to test the device in temperature now. At room temperature this pipe runs.

But in our application we are using the sensor in camera mode 0, which means using a resolution of 3264x2464. If I change the resolution in your command, the pipe doesn’t run. I’m getting following error:

jetson@jetson:~$ gst-launch-1.0 nvarguscamerasrc sensor-id=0 ! 'video/x-raw(memory:NVMM),width=3264,height=2464' ! fakesink nvarguscamerasrc sensor-id=1 ! 'video/x-raw(memory:NVMM),width=3264,height=2464' ! fakesink
Setting pipeline to PAUSED ...
Pipeline is live and does not need PREROLL ...
Setting pipeline to PLAYING ...
New clock: GstSystemClock
GST_ARGUS: Creating output stream
GST_ARGUS: Creating output stream
CONSUMER: Waiting until producer is connected...
CONSUMER: Waiting until producer is connected...
GST_ARGUS: Available Sensor modes :
GST_ARGUS: Available Sensor modes :
GST_ARGUS: 3264 x 2464 FR = 21,000000 fps Duration = 47619048 ; Analog Gain range min 1,000000, max 10,625000; Exposure Range min 13000, max 683709000;

GST_ARGUS: 3264 x 2464 FR = 21,000000 fps Duration = 47619048 ; Analog Gain range min 1,000000, max 10,625000; Exposure Range min 13000, max 683709000;

GST_ARGUS: 3264 x 1848 FR = 28,000001 fps Duration = 35714284 ; Analog Gain range min 1,000000, max 10,625000; Exposure Range min 13000, max 683709000;

GST_ARGUS: 1920 x 1080 FR = 29,999999 fps Duration = 33333334 ; Analog Gain range min 1,000000, max 10,625000; Exposure Range min 13000, max 683709000;

GST_ARGUS: 1280 x 720 FR = 59,999999 fps Duration = 16666667 ; Analog Gain range min 1,000000, max 10,625000; Exposure Range min 13000, max 683709000;

GST_ARGUS: 3264 x 1848 FR = 28,000001 fps Duration = 35714284 ; Analog Gain range min 1,000000, max 10,625000; Exposure Range min 13000, max 683709000;

GST_ARGUS: 1280 x 720 FR = 120,000005 fps Duration = 8333333 ; Analog Gain range min 1,000000, max 10,625000; Exposure Range min 13000, max 683709000;

GST_ARGUS: 1920 x 1080 FR = 29,999999 fps Duration = 33333334 ; Analog Gain range min 1,000000, max 10,625000; Exposure Range min 13000, max 683709000;

ARGUS_ERROR: Error generated. /dvs/git/dirty/git-master_linux/multimedia/nvgstreamer/gst-nvarguscamera/gstnvarguscamerasrc.cpp, execute: 679 Frame Rate specified is greater than supported
GST_ARGUS: Running with following settings:
   Camera index = 0 
   Camera mode  = 0 
   Output Stream W = 3264 H = 2464 
   seconds to Run    = 0 
   Frame Rate = 21,000000 
GST_ARGUS: Setup Complete, Starting captures for 0 seconds
GST_ARGUS: Starting repeat capture requests.
GST_ARGUS: 1280 x 720 FR = 59,999999 fps Duration = 16666667 ; Analog Gain range min 1,000000, max 10,625000; Exposure Range min 13000, max 683709000;

GST_ARGUS: 1280 x 720 FR = 120,000005 fps Duration = 8333333 ; Analog Gain range min 1,000000, max 10,625000; Exposure Range min 13000, max 683709000;

ARGUS_ERROR: Error generated. /dvs/git/dirty/git-master_linux/multimedia/nvgstreamer/gst-nvarguscamera/gstnvarguscamerasrc.cpp, execute: 679 Frame Rate specified is greater than supported
GST_ARGUS: Running with following settings:
   Camera index = 1 
   Camera mode  = 0 
   Output Stream W = 3264 H = 2464 
   seconds to Run    = 0 
   Frame Rate = 21,000000 
GST_ARGUS: Setup Complete, Starting captures for 0 seconds
GST_ARGUS: Starting repeat capture requests.
CONSUMER: Producer has connected; continuing.
ARGUS_ERROR: Error generated. /dvs/git/dirty/git-master_linux/multimedia/nvgstreamer/gst-nvarguscamera/gstnvarguscamerasrc.cpp, execute: 899 InvalidState.
GST_ARGUS: Cleaning up
CONSUMER: Producer has connected; continuing.
ARGUS_ERROR: Error generated. /dvs/git/dirty/git-master_linux/multimedia/nvgstreamer/gst-nvarguscamera/gstnvarguscamerasrc.cpp, execute: 899 InvalidState.
GST_ARGUS: Cleaning up
Got EOS from element "pipeline0".
Execution ended after 0:00:01.161176120
Setting pipeline to PAUSED ...
Setting pipeline to READY ...
Setting pipeline to NULL ...
Freeing pipeline ...

I had to add framerate to the caps, my bad.

gst-launch-1.0 nvarguscamerasrc sensor-id=0 ! 'video/x-raw(memory:NVMM),width=3264,height=2464,framerate=21/1' ! fakesink nvarguscamerasrc sensor-id=1 ! 'video/x-raw(memory:NVMM),width=3264,height=2464,framerate=21/1' ! fakesink

The pipe is running at room temp, I will check in colder temperatures now