I know, there is a bunch of threads related to this one, but none of them seem to be solved, so I have to post it once again.
The problem:
If the CAN bus enters BUS OFF state, the transceiver remains in an unoperational state and reports write: No buffer space available
Hardware: AGX Orin Dev Kit + SN65HVD1050D tranciever
Software: JetPack 5.1.2
How to replicate:
The easiest way to bring the bus into BUS OFF state is to shorten the two CAN wires to each other. I know, that is not very probable for software team, but the shortening is only used for tests.
We have experienced in our real products, that the bus could switch come to this state under certain conditions, that are hard to replicate, and it does not recover, so it is a problem for us.
When the MTT driver enters this state by whatever reason, and then the fault condition is removed, the driver returns to normal state (ERROR_ACTIVE). It can receive messages, but sending the messages is not possible anymore.
The full functionality of CAN is restored, if the kernel mttcan module is reloaded with rmmod and modprobe, but that is not a valid solution to use in real operations.
Do you mean short CAN-TX and CAN-RX? or CAN-H and CAN-L?
Dose this patch not work for your case?
Could you share the result of the following command on your board when you in this state?
$ ip -s -d link show can0
and… do you use ip link set command to re-configure CAN after you recover the “shorten two CAN wires”?
Please share the following information for further check…
Of course we don’t use the bus like this. It is a test case, but it is related to real life.
I’ve described in my first message: we do face BUS_OFF state in real operations. There it occurs due to some other reasons (not shortening the wires), but they are not easily reproducible in the lab. However, the wire shortening IS easily reproducible, and it can be used for testing this functionality.
Also, in the real world of robotics, intermittent wire shortening can happen, and the system should recover from this state and continue operation to be really robust.
We have tried this before applying the patch MTTCAN on Orin NX issues - #22 by KevinFFF
At that moment it was not helping.
Should the behaviour change with the patch applied?
--- a/drivers/net/can/mttcan/native/m_ttcan_linux.c
+++ b/drivers/net/can/mttcan/native/m_ttcan_linux.c
@@ -1338,7 +1338,14 @@ static int mttcan_close(struct net_device *dev)
napi_disable(&priv->napi);
mttcan_stop(priv);
free_irq(dev->irq, dev);
+
+ /* When we do power_down, it resets the mttcan HW by setting
+ * INIT bit. This clears the internal state of mttcan HW.
+ * We also then need to clear the internal states of driver.
+ */
+ priv->ttcan->tx_object = 0;
priv->hwts_rx_en = false;
+
I’ve verified it could fix this use case (short CAN-H, CAN-L causing CAN can’t recover issue) on the AGX Orin devkit.
If you still hit the issue, please add prints in this function to check if you apply it correctly.
I’m a colleague of @sergey25 and was applying the patch for him. Unfortunately, for us the Orin did not recover from this state.
Just to verify if I applied it correctly (because I don’t really have any experience with kernel development), here are the steps I performed:
Apply the patch to the file, then basically follow the Kernel Customization guide
To get the kernel module, run make ARCH=arm64 O=$KERNEL_OUT modules_install INSTALL_MOD_STRIP=1 INSTALL_MOD_PATH=$MODULES_OUT
Replace the system file at /lib/modules/5.10.120-tegra/kernel/drivers/net/can/mttcan/native/mttcan.ko with $MODULES_OUT/lib/modules/5.10.120-tegra/kernel/drivers/net/can/mttcan/native/mttcan.ko
Reboot
Is this everything that should be necessary? Or are we missing something?
To verify that my changes actually had an effect, I slightly changed the bus off error message at m_ttcan_linux.c@714. This change I could see, so I think I applied it correctly. However, I also put a netdev_info into the mttcan_close function and that one never gets printed. So it seems like your change isn’t even being called. Any ideas what we might be doing wrong here?
Could you check if the following reproduce steps are the same as yours?
Step 1. Load kernel module
$ sudo modprobe mttcan
Step 2. Setup interfaces
$ sudo ip link set can0 up type can bitrate 100000 berr-reporting on restart-ms 1000
$ sudo ip link set can1 up type can bitrate 100000 berr-reporting on restart-ms 1000
Step 3. Short CAN-H/CAN-L of can0
Step 4. Send packets
$ cangen can0
=> failed, error messages printed
Step 5. Check can0 status
$ sudo ip -d -s link show can0
8: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 16 qdisc pfifo_fast state UP mode DEFAULT group default qlen 10
link/can promiscuity 0 minmtu 0 maxmtu 0
can <BERR-REPORTING> state ERROR-ACTIVE (berr-counter tx 0 rx 0) restart-ms 1000
bitrate 100000 sample-point 0.872
tq 80 prop-seg 54 phase-seg1 54 phase-seg2 16 sjw 1
mttcan: tseg1 2..255 tseg2 0..127 sjw 1..127 brp 1..511 brp-inc 1
mttcan: dtseg1 1..31 dtseg2 0..15 dsjw 1..15 dbrp 1..15 dbrp-inc 1
clock 50000000
re-started bus-errors arbit-lost error-warn error-pass bus-off
2 8 0 2 2 2 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
RX: bytes packets errors dropped overrun mcast
128 16 8 0 0 0
TX: bytes packets errors dropped carrier collsns
49 8 0 0 0 0
Step 6. Recover CAN-H/CAN-L connection
Step 7. Send packets
=> write: No buffer space available
Step 8. Reconfigure CAN interface
$ sudo ip link set can0 down
$ sudo ip link set can1 down
$ sudo ip link set can0 up type can bitrate 100000 berr-reporting on restart-ms 1000
$ sudo ip link set can1 up type can bitrate 100000 berr-reporting on restart-ms 1000
Step 9. Send packets
=> write: No buffer space available
After applying the patch, the result in Step 9 should work.