AGX Orin: CAN bus does not automatic recover from BUS_OFF state

我们在LInux R35.4.1版本上测试mttcan ,当我们短接can_h和can_l时,CAN的状态机会进入BUS-OFF,然后重新回到ACTIVE,但是发送不出去数据,报write: No buffer space available,只能接收数据,我们在mttcan_bus_off_restart函数中添加了一句priv->ttcan->tx_object = 0之后能够发送数据了,请问这样修改对吗,有没有其他影响

請問你是用devkit or custom board?

你是否參考另篇在mttcan_close()的修正?

custom board,是的,参考了mttcan_close,另外还有一处进行了修改,删除了mttcan_start_xmit的netif_stop_queue函数

static netdev_tx_t mttcan_start_xmit(struct sk_buff *skb,
struct net_device *dev)
{
int msg_no = -1;
struct mttcan_priv *priv = netdev_priv(dev);
struct canfd_frame *frame = (struct canfd_frame *)skb->data;
if (can_dropped_invalid_skb(dev, skb))
return NETDEV_TX_OK;

if (can_is_canfd_skb(skb))
	frame->flags |= CAN_FD_FLAG;

spin_lock_bh(&priv->tx_lock);

/* Write Tx message to controller */
msg_no = ttcan_tx_msg_buffer_write(priv->ttcan,
		(struct ttcanfd_frame *)frame);
if (msg_no < 0)
	msg_no = ttcan_tx_fifo_queue_msg(priv->ttcan,
			(struct ttcanfd_frame *)frame);
if (msg_no < 0) {
	//netif_stop_queue(dev);
	spin_unlock_bh(&priv->tx_lock);
	return NETDEV_TX_BUSY;
}

Where do you add this line? Could you share the change?

Could you send and receive the expected data at this moment w/o any error messages?
Please also share the full dmesg for further check.

sorry,我们在测试短接can_h和can_l的时候,尽管已经修改了代码,还是会小概率的出现数据无法发送,经过添加打印信息定位到ttcan_tx_fifo_queue_msg函数,一直会进入这个条件判断( if (ttcan->tx_object & (1 << put_idx))),即使从bus-off恢复时清ttcan->tx_object,还是会偶现进入该条件,导致无法发送数据
int ttcan_tx_fifo_queue_msg(struct ttcan_controller *ttcan,
struct ttcanfd_frame *ttcanfd)
{
u32 txfqs_reg;
u32 put_idx;

txfqs_reg = ttcan_read32(ttcan, ADR_MTTCAN_TXFQS);

/* Test for Tx FIFO/Queue full */
if (txfqs_reg & MTT_TXFQS_TFQF_MASK)
{
	return -ENOMEM;
}
	

/* Test if Tx index is previously reserved in SW */
put_idx = (txfqs_reg & MTT_TXFQS_TFQPI_MASK) >> MTT_TXFQS_TFQPI_SHIFT;
if (ttcan->tx_object & (1 << put_idx))
{
	//printk("%s %d\n",__func__,__LINE__);
	return -ENOMEM;
}


/* Write to CAN controller message RAM */
ttcan_tx_ded_msg_write(ttcan, ttcanfd, put_idx);

return put_idx;

}

下面是bus-off到正常恢复发数据的dmesg:
[ 524.737699] mttcan c310000.mttcan can0: entered error passive state
[ 525.084584] mttcan c310000.mttcan can0: entered error passive state
[ 527.466724] mttcan c310000.mttcan can0: entered error warning state
[ 527.467986] mttcan c310000.mttcan can0: entered error passive state
[ 527.695419] mttcan c310000.mttcan can0: entered bus off state
[ 528.701229] Message RAM Configuration
| base addr |0x0c312000|
| sidfc_flssa |0x00000000|
| xidfc_flesa |0x00000040|
| rxf0c_f0sa |0x000000c0|
| rxf1c_f1sa |0x000009c0|
| rxbc_rbsa |0x000009c0|
| txefc_efsa |0x000009c0|
| txbc_tbsa |0x00000a40|
| tmc_tmsa |0x00000ec0|
| mram size |0x00001000|
[ 528.702612] Release 3.2.3 from 09.06.2018
[ 528.702624] mttcan_controller_config: ctrlmode 20
[ 528.702653] mttcan c310000.mttcan can0: Bitrate set
[ 528.702666] mttcan c310000.mttcan can0: wait for bus off seq
[ 528.714775] IPv6: ADDRCONF(NETDEV_CHANGE): can0: link becomes ready
[ 528.724354] mttcan c310000.mttcan can0: can_put_echo_skb: BUG! echo_skb 0 is occupied!
[ 528.727103] mttcan c310000.mttcan can0: entered error warning state
[ 528.728314] mttcan c310000.mttcan can0: entered error passive state
[ 528.748892] mttcan c310000.mttcan can0: can_put_echo_skb: BUG! echo_skb 1 is occupied!
[ 528.772795] mttcan c310000.mttcan can0: can_put_echo_skb: BUG! echo_skb 2 is occupied!
[ 528.796848] mttcan c310000.mttcan can0: can_put_echo_skb: BUG! echo_skb 3 is occupied!
[ 528.821801] mttcan c310000.mttcan can0: can_put_echo_skb: BUG! echo_skb 4 is occupied!
[ 528.845501] mttcan c310000.mttcan can0: can_put_echo_skb: BUG! echo_skb 5 is occupied!
[ 528.869049] mttcan c310000.mttcan can0: can_put_echo_skb: BUG! echo_skb 6 is occupied!
[ 528.894328] mttcan c310000.mttcan can0: can_put_echo_skb: BUG! echo_skb 7 is occupied!
[ 528.918615] mttcan c310000.mttcan can0: can_put_echo_skb: BUG! echo_skb 8 is occupied

Yes, I’ve just verified your changes and would also get this error so that I asked if you would get any errors.

Could you check if bringing CAN interface down/up helps for your case?

$ sudo ip link set can0 down
$ sudo ip link set can0 up type can bitrate 100000 berr-reporting on restart-ms 1000

是这样的的,虽然dmesg中报了BUG! echo_skb 1 is occupied!这个错误,但是还是能够自恢复,能够正常发送数据,应用不再报write: No buffer space available,但是偶现不能恢复的情况,下面截图就是我测试短接了出现33次之后,发送数据不能自恢复,这个时候执行sudo ip link set can0 down、sudo ip link set can0 up type can bitrate 100000 berr-reporting on restart-ms 1000后又能发送数据,结合前面的追踪,在bus-off到ttcan_tx_fifo_queue_msg,ttcan->tx_object还是被置为某个消息编号值未被释放,导致一直获取不到有效的msgno

Please just apply the following patch, it should help for your use case.

--- a/drivers/net/can/mttcan/native/m_ttcan_linux.c
+++ b/drivers/net/can/mttcan/native/m_ttcan_linux.c
@@ -491,6 +491,7 @@ static int mttcan_state_change(struct net_device *dev,
                priv->can.can_stats.bus_off++;
 
                netif_carrier_off(dev);
+               priv->ttcan->tx_object = 0;
 
                if (priv->can.restart_ms)
                        schedule_delayed_work(&priv->drv_restart_work,

只修改这里不行的,还需要删除了mttcan_start_xmit的netif_stop_queue函数

請問是哪裡會不行呢?
我這邊驗證看起來加了這行後的行為就會是預期的,當把CAN-H/CAN-L連接回正常狀態,即可自行恢復並繼續收發packets,且不會有以上echo_ekb的錯誤訊息

我们的测试方法是这样的:
1、每隔10ms循环执行 cansend can0 123#112233
2、短接CAN_H和CAN_L,然后再重新分开接上CAN_H和CAN_L
如果只修改mttcan_state_change的tx_object ,重新接上之后cansend仍然一直报write: No buffer space available,通过ip -details -statistics link show can0 可以看到can状态从bus-off恢复,但是只能收不能发

請問你是如何setup can0的?

sudo ip link set can0 up type can bitrate 500000 sample-point 0.8 dbitrate 5000000 dsample-point 0.8 fd on restart-ms 1000

看起來問題在於restart-ms 的設定,你必須在restart-ms的時間內恢復連接才能trigger mttcan_bus_off_restart來清空tx_object,麻煩試著設定restart-ms 5000,並在5s內short CAN-H/CAN-L且恢復連接看是否可正常發送CAN packets

不是这样的,mttcan的bus-off状态不是CAN_H/CAN_L正常连接后恢复,而是只要离开短接状态,就会从bus-off恢复到active状态,所以我们测试是只是短时间触碰了CAN_H/CAN_L,这个时间很短,mttcan很快就从bus-off恢复;这里的问题是mttcan 处于bus-off时,我们应用仍然在发送数据,这个时候mttcan_start_xmit的netif_stop_queue被调用后,net层无法再将数据送到mttcan的驱动了,缺少恢复net队列的函数

我指的是short完後要在restart-ms內恢復連接,不只是讓CAN-H/CAN-L unshorted, 你也要讓CAN-H接回CAN-H, CAN-L接回CAN-L
Bus Off是由於CAN-H/CAN-L的short造成
CAN driver會在Bus Off後的restart-ms觸發mttcan_bus_off_restart()來清空tx_object

Ok,我懂你的意思了,你的意思只要5s内恢复连接,netif_stop_queue就不会被调用,恢复连接后发送数据mttcan_start_xmit就能正常发数据,就不会走到netif_stop_queue这个分支?

我改成5s、10s、20s都测了,还是发不出来

There is no update from you for a period, assuming this is not an issue any more.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

Is this still an issue to support? Any result can be shared?