AGX Orin: CAN bus does not recover from BUS_OFF state

I know, there is a bunch of threads related to this one, but none of them seem to be solved, so I have to post it once again.

The problem:
If the CAN bus enters BUS OFF state, the transceiver remains in an unoperational state and reports
write: No buffer space available
Hardware: AGX Orin Dev Kit + SN65HVD1050D tranciever
Software: JetPack 5.1.2

How to replicate:
The easiest way to bring the bus into BUS OFF state is to shorten the two CAN wires to each other. I know, that is not very probable for software team, but the shortening is only used for tests.
We have experienced in our real products, that the bus could switch come to this state under certain conditions, that are hard to replicate, and it does not recover, so it is a problem for us.

When the MTT driver enters this state by whatever reason, and then the fault condition is removed, the driver returns to normal state (ERROR_ACTIVE). It can receive messages, but sending the messages is not possible anymore.

The full functionality of CAN is restored, if the kernel mttcan module is reloaded with rmmod and modprobe, but that is not a valid solution to use in real operations.

Solutions tried so far:

As I said, we do experience this issue in our real-life operations, where we have hundreds of robots running, so this a very hot topic for us.

Hi sergey25,

Do you mean short CAN-TX and CAN-RX? or CAN-H and CAN-L?

Dose this patch not work for your case?

Could you share the result of the following command on your board when you in this state?

$ ip -s -d link show can0

and… do you use ip link set command to re-configure CAN after you recover the “shorten two CAN wires”?
Please share the following information for further check…

  1. detailed reproduce steps
  2. the block diagram of your connection
  3. full serial console log
  1. We shorten CAN H and CAN L of course
  2. The patch has not helped, otherwise I would not have created a new thread
  3. This the dmesg log:
[   18.578353] can: controller area network core
[   18.600218] can: raw protocol
[   18.687789] net can0: mttcan device registered (regs=00000000bdb0d1df, irq=14)
[   18.694504] net can1: mttcan device registered (regs=000000001cecf914, irq=15)
[   18.698488] mttcan c310000.mttcan can0: Bitrate set
[   18.700377] mttcan_controller_config: ctrlmode 10
[   18.700400] mttcan c310000.mttcan can0: Bitrate set
[   18.700556] IPv6: ADDRCONF(NETDEV_CHANGE): can0: link becomes ready
[  183.202493] mttcan c310000.mttcan can0: Bit0 Error Detected
[  183.208404] mttcan c310000.mttcan can0: IR 0x8000000 PSR 0x71d
[  183.214701] mttcan c310000.mttcan can0: entered error warning state
[  183.221283] mttcan c310000.mttcan can0: entered error passive state
[  183.227819] mttcan c310000.mttcan can0: entered bus off state
[  183.233782] mttcan c310000.mttcan can0: Bit0 Error Detected
[  183.239563] mttcan c310000.mttcan can0: IR 0xb800000 PSR 0x7e5
[  184.251275] mttcan_controller_config: ctrlmode 10
[  184.251311] mttcan c310000.mttcan can0: Bitrate set
[  184.251320] mttcan c310000.mttcan can0: wait for bus off seq
[  184.263412] mttcan c310000.mttcan can0: Bit0 Error Detected
[  184.269257] mttcan c310000.mttcan can0: IR 0xa000000 PSR 0x70d
[  184.276562] IPv6: ADDRCONF(NETDEV_CHANGE): can0: link becomes ready

This the interface statistics after the bus_off state:

12: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 16 qdisc pfifo_fast state UP mode DEFAULT group default qlen 10
    link/can  promiscuity 0 minmtu 0 maxmtu 0
    can <BERR-REPORTING> state ERROR-ACTIVE (berr-counter tx 0 rx 0) restart-ms 1000
          bitrate 500000 sample-point 0.870
          tq 20 prop-seg 43 phase-seg1 43 phase-seg2 13 sjw 1
          mttcan: tseg1 2..255 tseg2 0..127 sjw 1..127 brp 1..511 brp-inc 1
          mttcan: dtseg1 1..31 dtseg2 0..15 dsjw 1..15 dbrp 1..15 dbrp-inc 1
          clock 50000000
          re-started bus-errors arbit-lost error-warn error-pass bus-off
          1          3          0          1          1          1         numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
    RX: bytes  packets  errors  dropped overrun mcast
    146210     18572    3       0       0       0
    TX: bytes  packets  errors  dropped carrier collsns
    44316      7386     0       0       0       0

The RX counter increments both before and after BUS_OFF state. The TX counter stops incrementing after BUS_OFF state.

  1. The data provided here has the simpliest CAN topology: Just a single node connected to the bus through a CAN tranciever.

  2. detailed reproduce steps:

  • Setup CAN interface
modprobe can
modprobe can_raw
modprobe mttcan

ip link set can0 type can bitrate 500000 berr-reporting on restart-ms 1000
ip link set up can0
  • Start message communication through socketcan API
  • Shorten CAN H and CAN L and release them after 3 seconds.

Is there other operation could reproduce the similar issue rather than short the CAN-H and CAN-L?
It seems an unexpected usage for CAN bus.

It seems an unexpected usage for CAN bus.

Of course we don’t use the bus like this. It is a test case, but it is related to real life.

I’ve described in my first message: we do face BUS_OFF state in real operations. There it occurs due to some other reasons (not shortening the wires), but they are not easily reproducible in the lab. However, the wire shortening IS easily reproducible, and it can be used for testing this functionality.
Also, in the real world of robotics, intermittent wire shortening can happen, and the system should recover from this state and continue operation to be really robust.

Hello KevinFFF

Is there any progress on this issue?

Have you tried to re-configure the can interface instead of reloading mttcan module and check if it could get recovered?

sudo ip link set can0 down
sudo ip link set can0 up type can bitrate 100000

We have tried this before applying the patch MTTCAN on Orin NX issues - #22 by KevinFFF
At that moment it was not helping.
Should the behaviour change with the patch applied?

Yes, the patch should help for such BUS_OFF state and “No buffer space available” issue.

Unfortunately, bringing the interface down and then up does not help. cansend still returns “No buffer space available

Could you try using cangen to send more packets and check if it could get recovered?

Unfortunately, it does not recover :(

Let me check this issue with internal and get back to you if there’s any result.

Hello KevinFFF

Any updates on the topic?

Any update?