AGX Orin: CAN bus does not recover from BUS_OFF state

I know, there is a bunch of threads related to this one, but none of them seem to be solved, so I have to post it once again.

The problem:
If the CAN bus enters BUS OFF state, the transceiver remains in an unoperational state and reports
write: No buffer space available
Hardware: AGX Orin Dev Kit + SN65HVD1050D tranciever
Software: JetPack 5.1.2

How to replicate:
The easiest way to bring the bus into BUS OFF state is to shorten the two CAN wires to each other. I know, that is not very probable for software team, but the shortening is only used for tests.
We have experienced in our real products, that the bus could switch come to this state under certain conditions, that are hard to replicate, and it does not recover, so it is a problem for us.

When the MTT driver enters this state by whatever reason, and then the fault condition is removed, the driver returns to normal state (ERROR_ACTIVE). It can receive messages, but sending the messages is not possible anymore.

The full functionality of CAN is restored, if the kernel mttcan module is reloaded with rmmod and modprobe, but that is not a valid solution to use in real operations.

Solutions tried so far:

As I said, we do experience this issue in our real-life operations, where we have hundreds of robots running, so this a very hot topic for us.

Hi sergey25,

Do you mean short CAN-TX and CAN-RX? or CAN-H and CAN-L?

Dose this patch not work for your case?

Could you share the result of the following command on your board when you in this state?

$ ip -s -d link show can0

and… do you use ip link set command to re-configure CAN after you recover the “shorten two CAN wires”?
Please share the following information for further check…

  1. detailed reproduce steps
  2. the block diagram of your connection
  3. full serial console log
  1. We shorten CAN H and CAN L of course
  2. The patch has not helped, otherwise I would not have created a new thread
  3. This the dmesg log:
[   18.578353] can: controller area network core
[   18.600218] can: raw protocol
[   18.687789] net can0: mttcan device registered (regs=00000000bdb0d1df, irq=14)
[   18.694504] net can1: mttcan device registered (regs=000000001cecf914, irq=15)
[   18.698488] mttcan c310000.mttcan can0: Bitrate set
[   18.700377] mttcan_controller_config: ctrlmode 10
[   18.700400] mttcan c310000.mttcan can0: Bitrate set
[   18.700556] IPv6: ADDRCONF(NETDEV_CHANGE): can0: link becomes ready
[  183.202493] mttcan c310000.mttcan can0: Bit0 Error Detected
[  183.208404] mttcan c310000.mttcan can0: IR 0x8000000 PSR 0x71d
[  183.214701] mttcan c310000.mttcan can0: entered error warning state
[  183.221283] mttcan c310000.mttcan can0: entered error passive state
[  183.227819] mttcan c310000.mttcan can0: entered bus off state
[  183.233782] mttcan c310000.mttcan can0: Bit0 Error Detected
[  183.239563] mttcan c310000.mttcan can0: IR 0xb800000 PSR 0x7e5
[  184.251275] mttcan_controller_config: ctrlmode 10
[  184.251311] mttcan c310000.mttcan can0: Bitrate set
[  184.251320] mttcan c310000.mttcan can0: wait for bus off seq
[  184.263412] mttcan c310000.mttcan can0: Bit0 Error Detected
[  184.269257] mttcan c310000.mttcan can0: IR 0xa000000 PSR 0x70d
[  184.276562] IPv6: ADDRCONF(NETDEV_CHANGE): can0: link becomes ready

This the interface statistics after the bus_off state:

12: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 16 qdisc pfifo_fast state UP mode DEFAULT group default qlen 10
    link/can  promiscuity 0 minmtu 0 maxmtu 0
    can <BERR-REPORTING> state ERROR-ACTIVE (berr-counter tx 0 rx 0) restart-ms 1000
          bitrate 500000 sample-point 0.870
          tq 20 prop-seg 43 phase-seg1 43 phase-seg2 13 sjw 1
          mttcan: tseg1 2..255 tseg2 0..127 sjw 1..127 brp 1..511 brp-inc 1
          mttcan: dtseg1 1..31 dtseg2 0..15 dsjw 1..15 dbrp 1..15 dbrp-inc 1
          clock 50000000
          re-started bus-errors arbit-lost error-warn error-pass bus-off
          1          3          0          1          1          1         numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
    RX: bytes  packets  errors  dropped overrun mcast
    146210     18572    3       0       0       0
    TX: bytes  packets  errors  dropped carrier collsns
    44316      7386     0       0       0       0

The RX counter increments both before and after BUS_OFF state. The TX counter stops incrementing after BUS_OFF state.

  1. The data provided here has the simpliest CAN topology: Just a single node connected to the bus through a CAN tranciever.

  2. detailed reproduce steps:

  • Setup CAN interface
modprobe can
modprobe can_raw
modprobe mttcan

ip link set can0 type can bitrate 500000 berr-reporting on restart-ms 1000
ip link set up can0
  • Start message communication through socketcan API
  • Shorten CAN H and CAN L and release them after 3 seconds.

Is there other operation could reproduce the similar issue rather than short the CAN-H and CAN-L?
It seems an unexpected usage for CAN bus.

It seems an unexpected usage for CAN bus.

Of course we don’t use the bus like this. It is a test case, but it is related to real life.

I’ve described in my first message: we do face BUS_OFF state in real operations. There it occurs due to some other reasons (not shortening the wires), but they are not easily reproducible in the lab. However, the wire shortening IS easily reproducible, and it can be used for testing this functionality.
Also, in the real world of robotics, intermittent wire shortening can happen, and the system should recover from this state and continue operation to be really robust.

Hello KevinFFF

Is there any progress on this issue?

Have you tried to re-configure the can interface instead of reloading mttcan module and check if it could get recovered?

sudo ip link set can0 down
sudo ip link set can0 up type can bitrate 100000

We have tried this before applying the patch MTTCAN on Orin NX issues - #22 by KevinFFF
At that moment it was not helping.
Should the behaviour change with the patch applied?

Yes, the patch should help for such BUS_OFF state and “No buffer space available” issue.

Unfortunately, bringing the interface down and then up does not help. cansend still returns “No buffer space available

Could you try using cangen to send more packets and check if it could get recovered?

Unfortunately, it does not recover :(

Let me check this issue with internal and get back to you if there’s any result.

Hello KevinFFF

Any updates on the topic?

Any update?

Hi @KevinFFF

Any updates on this issue?

Sorry that I’m still checking this issue with internal.
We are busy with another release at the end of Oct.

Do you have other modules like MCP2515 could reproduce the same issue?

Hi @sergey25,

Could you apply the following patch and verify?

--- a/drivers/net/can/mttcan/native/m_ttcan_linux.c
+++ b/drivers/net/can/mttcan/native/m_ttcan_linux.c
@@ -1338,7 +1338,14 @@ static int mttcan_close(struct net_device *dev)
 	napi_disable(&priv->napi);
 	mttcan_stop(priv);
 	free_irq(dev->irq, dev);
+
+	/* When we do power_down, it resets the mttcan HW by setting
+	 * INIT bit. This clears the internal state of mttcan HW.
+	 * We also then need to clear the internal states of driver.
+	 */
+	priv->ttcan->tx_object = 0;
 	priv->hwts_rx_en = false;
+

I’ve verified it could fix this use case (short CAN-H, CAN-L causing CAN can’t recover issue) on the AGX Orin devkit.

If you still hit the issue, please add prints in this function to check if you apply it correctly.

Hi @KevinFFF,

I’m a colleague of @sergey25 and was applying the patch for him. Unfortunately, for us the Orin did not recover from this state.

Just to verify if I applied it correctly (because I don’t really have any experience with kernel development), here are the steps I performed:

  1. Apply the patch to the file, then basically follow the Kernel Customization guide
  2. To get the kernel module, run make ARCH=arm64 O=$KERNEL_OUT modules_install INSTALL_MOD_STRIP=1 INSTALL_MOD_PATH=$MODULES_OUT
  3. Replace the system file at /lib/modules/5.10.120-tegra/kernel/drivers/net/can/mttcan/native/mttcan.ko with $MODULES_OUT/lib/modules/5.10.120-tegra/kernel/drivers/net/can/mttcan/native/mttcan.ko
  4. Reboot

Is this everything that should be necessary? Or are we missing something?

To verify that my changes actually had an effect, I slightly changed the bus off error message at m_ttcan_linux.c@714. This change I could see, so I think I applied it correctly. However, I also put a netdev_info into the mttcan_close function and that one never gets printed. So it seems like your change isn’t even being called. Any ideas what we might be doing wrong here?

Your steps seem correct for me.

Could you check if the following reproduce steps are the same as yours?

Step 1. Load kernel module
$ sudo modprobe mttcan

Step 2. Setup interfaces
$ sudo ip link set can0 up type can bitrate 100000 berr-reporting on restart-ms 1000
$ sudo ip link set can1 up type can bitrate 100000 berr-reporting on restart-ms 1000

Step 3. Short CAN-H/CAN-L of can0

Step 4. Send packets
$ cangen can0
=> failed, error messages printed

Step 5. Check can0 status
$ sudo ip -d -s link show can0
8: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 16 qdisc pfifo_fast state UP mode DEFAULT group default qlen 10
    link/can  promiscuity 0 minmtu 0 maxmtu 0 
    can <BERR-REPORTING> state ERROR-ACTIVE (berr-counter tx 0 rx 0) restart-ms 1000 
          bitrate 100000 sample-point 0.872 
          tq 80 prop-seg 54 phase-seg1 54 phase-seg2 16 sjw 1
          mttcan: tseg1 2..255 tseg2 0..127 sjw 1..127 brp 1..511 brp-inc 1
          mttcan: dtseg1 1..31 dtseg2 0..15 dsjw 1..15 dbrp 1..15 dbrp-inc 1
          clock 50000000 
          re-started bus-errors arbit-lost error-warn error-pass bus-off
          2          8          0          2          2          2         numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 
    RX: bytes  packets  errors  dropped overrun mcast   
    128        16       8       0       0       0       
    TX: bytes  packets  errors  dropped carrier collsns 
    49         8        0       0       0       0 

Step 6. Recover CAN-H/CAN-L connection

Step 7. Send packets
=> write: No buffer space available

Step 8. Reconfigure CAN interface
$ sudo ip link set can0 down
$ sudo ip link set can1 down
$ sudo ip link set can0 up type can bitrate 100000 berr-reporting on restart-ms 1000
$ sudo ip link set can1 up type can bitrate 100000 berr-reporting on restart-ms 1000

Step 9. Send packets
=> write: No buffer space available

After applying the patch, the result in Step 9 should work.