Orin: CAN Bus Not recovering from ERROR-Passive

I am experiencing the exact same issue as this post: CAN bus not recovering from ERROR-PASSIVE.

Orin AGX. Jetpack 5.1.2.

CAN bus 50% of time does not recover from ERROR-Passive.
rmmod mttcan, modprobe mttcan resolves.

However, this is not acceptable for our application. It requires us to complete halt our application and restart it.

Can we get some help with this?

This has been fixed in Jetpack 5.1.3.

Great! Do you have any specifics as to what was fix/changed?

@KevinFFF Could you please tell me what was changed in 5.1.3 to resolve this problem? And, is there a way for me to simply patch the fixed driver in 5.1.2?

@KevinFFF In comparing the mttcan source between 35.4.1 and 35.5, the following is the only change. This seems like it could resolve the problem when the can device is closed and reopened, but it would not resolve the bad state that the driver is getting into where it cannot recover from the ERROR-PASSIVE state. Please help!

Hi tgreier,

Are you using the devkit or custom board for AGX Orin?

Would it work if you apply that CAN patch from R35.5.0 to your R35.4.1?

Is there any issue with ERROR-PASSIVE state?
Do you know how it enters in to this state?
Keep sending CAN packet may help to recover from this state.

@KevinFFF
We are using Forge Carrier (Connect Tech) for Orin AGX.

The patch from R35.5.0 will only allow us to not require a rmmod to resolve the bad state that the CAN interface enters. The patch WOULD NOT resolve the initial problem, being that the CAN interface enters this state to begin with.

The ERROR-PASSIVE state itself is not the problem. The problem is that the CAN interface does not flush its transmit buffers when the receiving CAN device comes on-line. The CAN interface should be able to automatically recover.

We believe it enters this state because the receiving CAN device is not yet powered-up. When the receiving CAN device powers-up, the CAN transmits should succeed and the transmit buffer should then be emptied. However, the CAN interface does not recover, and no CAN packets can be transmitted until a rmmod of the mttcan. The CAN device should be able to automatically recover.

Our transmitting CAN device is already continuously trying to transmit, but the CAN interface does not recover until we rmmod mttcan and restart our application.

@KevinFFF Are there more diagnostics that we can run do determine the source of the problem? In my opinion, the problem is that the mttcan driver is not recovering from the ERROR-Passive state once the other CAN device is powered. The symptom occurs about 50% of the time.

Is there any error message when you are trying to transmit CAN data at this moment?

Could you share the detailed reproduce steps on the board for us to verify it locally?
(maybe transmit CAN data from can0 to can1, and don’t connect them first!?)

The error message received when attempting to transmit CAN data (when the other CAN device is not yet powered) is ‘transmit buffer full’.

  1. CAN device 0 is powered and configured, external CAN device 1 is not powered and not configured.
  2. CAN device 0 sends 10 packets per second. After a few seconds, CAN device 0 enters ERROR-PASSIVE state. CAN device 0 responds with ‘transmit buffer full’ error.
  3. After 5-10 minutes, external CAN device 1 is powered and configured.
  4. 50% of the time, CAN device 0 recovers and begins transmitting. 50% of the time, CAN device 0 does not recover and never transmits. Only way to recover is to either power cycle or rmmod mttcan then reload mttcan.

Can this steps run on the one AGX Orin through can0 and can1 for us to verify?

May I also know that how do you configure/setup the CAN interface?

@KevinFFF I will assemble a script to hopefully replicate the problem.

Okay, you can share a script to verify it locally and also let us know about how you connect them.

The solution presented in the previous discussion about this topic does not properly address the issue, which is for the driver to automatically recover from a bus-off state.

The patch only clears the tx_object when the interface is brought down.

I have tested a patch that adds the clearing of the tx_object which holds the bitmap status of the messages in the tx_mailboxes.

diff --git a/nvidia/drivers/net/can/mttcan/native/m_ttcan_linux.c b/nvidia/drivers/net/can/mttcan/native/m_ttcan_linux.c
index 18132a7..d506f99 100644
--- a/nvidia/drivers/net/can/mttcan/native/m_ttcan_linux.c
+++ b/nvidia/drivers/net/can/mttcan/native/m_ttcan_linux.c
@@ -1096,6 +1096,7 @@ static void mttcan_bus_off_restart(struct work_struct *work)
 restart:
 	netdev_dbg(dev, "restarted\n");
 	priv->can.can_stats.restarts++;
+	priv->ttcan->tx_object = 0;
 
 	mttcan_start(dev);
 	netif_carrier_on(dev);

The problem with this patch is that any messages that were not yet transmitted will be lost when the driver restarts.
I’m working on a fix that improves this, but it’s not fully flushed out as some messages are still lost.

diff --git a/nvidia/drivers/net/can/mttcan/native/m_ttcan_linux.c b/nvidia/drivers/net/can/mttcan/native/m_ttcan_linux.c
index 18132a7..43d8113 100644
--- a/nvidia/drivers/net/can/mttcan/native/m_ttcan_linux.c
+++ b/nvidia/drivers/net/can/mttcan/native/m_ttcan_linux.c
@@ -1079,6 +1079,8 @@ static void mttcan_bus_off_restart(struct work_struct *work)
 	struct net_device_stats *stats = &dev->stats;
 	struct sk_buff *skb;
 	struct can_frame *cf;
+	u32 msg_no;
+	u32 unsent_tx;
 
 	/* send restart message upstream */
 	skb = alloc_can_err_skb(dev, &cf);
@@ -1099,6 +1101,13 @@ restart:
 
 	mttcan_start(dev);
 	netif_carrier_on(dev);
+	// need to attempt to restransmit any messages stuck in tx_object
+	unsent_tx = priv->ttcan->tx_object;
+	while (unsent_tx) {
+		msg_no = ffs(unsent_tx) - 1;
+		ttcan_tx_trigger_msg_transmit(priv->ttcan, msg_no);
+		unsent_tx &= ~(1U << msg_no);
+	}	
 }
 
 static void mttcan_start(struct net_device *dev)

open to any suggestions on how to ensure all messages that were not sent during the bus-off condition can be sent. I suspect that these failed to on the netif layer, but not 100% sure.

this is what i was using to check that messages are sent and how many are lost:

counter=0; while true; do payload=$(printf "1%014x" $counter); cansend can0 1F334454##$payload; ((counter++)); sleep 0.01; done

and this is how i bring up the can if:

sudo ip link set can0 type can bitrate 1000000 dbitrate 4000000 fd on sample-point .80 dsample-point .80 restart-ms 1 berr-reporting on
1 Like

We are working on fixing CAN related issue.

Please help to share the detailed steps how you reproduce the issue.

What additional steps do you require?

Just set up a can network with a node that has slightly different bit timing settings, or a bus that is under-terminated.

Then have the orin repeatedly send the message. Without the change i posted above to “mttcan_bus_off_restart” the device will go into a bus-off state and not recover until the IF is brought down and back up.

With my change to “mttcan_bus_off_restart” it will recover and keep sending until it goes into a bus-off state.

But as i mentioned there will be some messages lost during the bus-off and restart.
If you candump running on another device you will see that there are some messages missing. (Note that the payload is incremented by 1 with each message).

So in summary have a can BUS with 3 nodes. The orin under test, an incorrectly configured node to produce bus-off condition in the “orin under test” and a third node that is correctly to run candump during the test.

It seems the expected result to me since the configuration(like timing) at both sides do not match.
Please confirm that you used the correct CAN configuration to setup the interface at both side.

I am still working on building a script to reproduce the issue. Please do not close this issue.

Thanks, please provide the steps/setup to reproduce the issue.