Orin fails with message nvgpu: 17000000.ga10b ga10b_pbdma_handle_intr_0_legacy:437 [ERR] semaphore acquire timeout!

Hi,
This error is rare, I had seen it few times over several months of testing. I happened twice in the last 2 days, both times at night.
It happened with OS 35.3.1 and earlier.
I found that in those few cases that I managed to obtain syslog from there is a correlation between message

systemd-timesyncd[376]: Initial synchronization to time server 104.171.113.34:123 (1.pool.ntp.org).
which is followed by
kernel: [   46.450236] nvgpu: 17000000.ga10b  ga10b_pbdma_handle_intr_0_legacy:437  [ERR]  semaphore acquire timeout!
kernel: [   46.460468] __ga10b__ Channel Status - chip ga10b
kernel: [   46.460470] __ga10b__ ---------------------------
kernel: [   46.465329] __ga10b__ 420-ga10b, TSG: 26, pid 2879, refs: 2, deterministic: no, domain name: (default)
kernel: [   46.470176] __ga10b__ channel status:  in use idle not busy
after that __ga10b__ errors are printed non-stop and Orin requires reboot.
However, message "Initial synchronization to time server" appears in good boots as well, but in bad cases it always precedes the __ga10b__ error.

What can it be? Is it possible that “semaphore acquire timeout” somehow causes by system time change?
I see that other people reported this error after system resume, but I never performed any suspend or resume, just reboot using “reboot” command.

Thank you

HI,

We may need full error log. You could use serial console from micro usb port to capture log next time NVIDIA Jetson Orin - Serial Console - RidgeRun Developer Connection

And please try to see if you can figure out a method to reproduce issue.

Reproducing it again with serial log connected will take time.
Meanwhile I attached syslog from the last event.
orin2_syslog_semaphore_acquire_timeout_2023_06_16_to_send.txt (7.2 MB)

Any serial log and the reproduce steps can be provided?

I gave you syslog with the error, which appears to have all the information, the same as serial log (which I do not have yet).
May be you can point me to the place in the kernel sources, were this error comes from, so I can put more logs there?