We are seeing some intermittent CAN bus-off issues on our TX2 device.
This issue happens from every 1 in 3 boots to 1 in 50 boots.
We are able to reproduce this issue on a partial CAN bus containing only the TX2 and one STM32F0 device connected over CAN.
The TX2 and STM32F0 each have their own CAN transceiver.
Our power-on sequence guarantees the STM32F0 will be powered on at least 5 seconds before the TX2 is powered.
Here is a snippet of some logging captured while in the bus-off state.
sys_can is an alias we setup for the can0 peripheral.
[root@SRH901160966 ~]$ ip -details -statistics link show sys_can
6: sys_can: <NO-CARRIER,NOARP,UP,ECHO> mtu 16 qdisc prio state DOWN mode DEFAULT group default qlen 10
link/can promiscuity 0
can state BUS-OFF (berr-counter tx 248 rx 127) restart-ms 0
bitrate 1000000 sample-point 0.750
tq 25 prop-seg 14 phase-seg1 15 phase-seg2 10 sjw 1
mttcan: tseg1 2..255 tseg2 0..127 sjw 1..127 brp 1..511 brp-inc 1
clock 40000000
re-started bus-errors arbit-lost error-warn error-pass bus-off
0 0 0 1 1 0
RX: bytes packets errors dropped overrun mcast
24 3 0 2 0 0
TX: bytes packets errors dropped carrier collsns
0 0 0 0 0 0
[root@SRH901160966 ~]$ dmesg | grep mttcan
[ 9.517664] net can0: mttcan device registered (regs=ffffff8006ea2000, irq=422)
[ 9.519608] net can1: mttcan device registered (regs=ffffff8006eac000, irq=423)
[ 9.537301] mttcan c320000.mttcan pld_can: renamed from can1
[ 9.631234] mttcan c310000.mttcan sys_can: renamed from can0
[ 10.395848] mttcan c310000.mttcan sys_can: Bitrate set
[ 10.403019] mttcan c320000.mttcan pld_can: Bitrate set
[ 10.526789] mttcan_controller_config: ctrlmode 0
[ 10.531510] mttcan c320000.mttcan pld_can: Bitrate set
[ 10.537476] mttcan_controller_config: ctrlmode 0
[ 10.542850] mttcan c310000.mttcan sys_can: Bitrate set
[ 10.603679] mttcan c310000.mttcan sys_can: entered error warning state
[ 10.610864] mttcan c310000.mttcan sys_can: entered error passive state
[ 14.102163] mttcan c310000.mttcan sys_can: entered bus off state
Once in the bus-off state, manually bringing down the sys_can interface and reloading the mttcan driver with modprobe does not seem to help much.
It does clear the bus-off error but it quickly enters the bus-off state again due to more errors.
We have some oscilloscope plots of CAN_TX/CAN_RX for both good and bad boots, triggered on CAN frame errors.
The TX2 CAN_TX/CAN_RX are CH1 (yellow) and CH2 (green) respectively.
The STM32F0 CAN_TX/CAN_RX are CH3 (blue) and CH4 (purple) respectively.
We have noticed that the TX2 incorrectly (?) asserts CAN_TX low (bit 0) during the data portion of the CAN frame.
This happens on both good and bad boots but happens more on bad boots.
Bad-FlyerBott-seg-1.bmp (1.4 MB)
Any suggestions to debug or fix this issue would be appreciated.
Right now, we suspect one of two things:
-
The TX2 is not using the configured bitrate of 1 Mbit for some reason, possibly due to internal clocks being in a bad state?
-
In the mttcan driver, is it possible that the receiver is enabled before the driver is done configuring the clock rates etc.?
Will it start generating and counting error frames toward its bus-off total before it is configured?
Our system is currently based off an L4T-28 release.