TX2 - Intermittent CAN Bus-Off After Boot

We are seeing some intermittent CAN bus-off issues on our TX2 device.
This issue happens from every 1 in 3 boots to 1 in 50 boots.

We are able to reproduce this issue on a partial CAN bus containing only the TX2 and one STM32F0 device connected over CAN.
The TX2 and STM32F0 each have their own CAN transceiver.
Our power-on sequence guarantees the STM32F0 will be powered on at least 5 seconds before the TX2 is powered.

Here is a snippet of some logging captured while in the bus-off state.
sys_can is an alias we setup for the can0 peripheral.

[root@SRH901160966 ~]$ ip -details -statistics link show sys_can
6: sys_can: <NO-CARRIER,NOARP,UP,ECHO> mtu 16 qdisc prio state DOWN mode DEFAULT group default qlen 10
    link/can  promiscuity 0
    can state BUS-OFF (berr-counter tx 248 rx 127) restart-ms 0
    bitrate 1000000 sample-point 0.750
    tq 25 prop-seg 14 phase-seg1 15 phase-seg2 10 sjw 1
    mttcan: tseg1 2..255 tseg2 0..127 sjw 1..127 brp 1..511 brp-inc 1
    clock 40000000
    re-started bus-errors arbit-lost error-warn error-pass bus-off
    0          0          0          1          1          0
    RX: bytes  packets  errors  dropped overrun mcast
    24         3        0       2       0       0
    TX: bytes  packets  errors  dropped carrier collsns
    0          0        0       0       0       0
[root@SRH901160966 ~]$ dmesg | grep mttcan
[    9.517664] net can0: mttcan device registered (regs=ffffff8006ea2000, irq=422)
[    9.519608] net can1: mttcan device registered (regs=ffffff8006eac000, irq=423)
[    9.537301] mttcan c320000.mttcan pld_can: renamed from can1
[    9.631234] mttcan c310000.mttcan sys_can: renamed from can0
[   10.395848] mttcan c310000.mttcan sys_can: Bitrate set
[   10.403019] mttcan c320000.mttcan pld_can: Bitrate set
[   10.526789] mttcan_controller_config: ctrlmode 0
[   10.531510] mttcan c320000.mttcan pld_can: Bitrate set
[   10.537476] mttcan_controller_config: ctrlmode 0
[   10.542850] mttcan c310000.mttcan sys_can: Bitrate set
[   10.603679] mttcan c310000.mttcan sys_can: entered error warning state
[   10.610864] mttcan c310000.mttcan sys_can: entered error passive state
[   14.102163] mttcan c310000.mttcan sys_can: entered bus off state

Once in the bus-off state, manually bringing down the sys_can interface and reloading the mttcan driver with modprobe does not seem to help much.
It does clear the bus-off error but it quickly enters the bus-off state again due to more errors.

We have some oscilloscope plots of CAN_TX/CAN_RX for both good and bad boots, triggered on CAN frame errors.

The TX2 CAN_TX/CAN_RX are CH1 (yellow) and CH2 (green) respectively.
The STM32F0 CAN_TX/CAN_RX are CH3 (blue) and CH4 (purple) respectively.

We have noticed that the TX2 incorrectly (?) asserts CAN_TX low (bit 0) during the data portion of the CAN frame.
This happens on both good and bad boots but happens more on bad boots.

Bad-FlyerBott-seg-1.bmp (1.4 MB)

Any suggestions to debug or fix this issue would be appreciated.
Right now, we suspect one of two things:

  1. The TX2 is not using the configured bitrate of 1 Mbit for some reason, possibly due to internal clocks being in a bad state?

  2. In the mttcan driver, is it possible that the receiver is enabled before the driver is done configuring the clock rates etc.?
    Will it start generating and counting error frames toward its bus-off total before it is configured?
    Our system is currently based off an L4T-28 release.

Yes there is a issue with pllaon clock which is used as CAN clock parent on TX2. It has been fixed in our latest releases from R32.3 and above.
Can you move to latest release?

It is difficult for us to move to R32 right now, but something we are planning.
In the interim, is it possible to manually patch this issue?

We have access to the L4T-32.5 source. Is there a commit hash with the clock fix that I could cherry-pick onto L4T-28?
If not, do you know the relevant parts of the source code that have changed for the fix?

Hi john,
It is the change in MB1 binary which is a signed binary, you cannot just cherry-pick a commit.
It is also tricky to just replace MB1 binary of latest in old. Thus, we recommend to switch to latest L4T release.

Thanks! This is good information.
I have manually flashed to L4T-32.5 and it seems to have resolved the issue.

One final question hopefully, do you know if this is a regression introduced in a particular version of L4T?
Or has this been latent in all versions prior to L4T-32.3?

Sorry, I am not sure from which version it started. But I know it is fixed from R32.3

Some more observations:

  • On bad boots where CAN enters bus-off, the bitrate is 900 kbps as measured on a scope when set to 1 Mbps. Setting the bitrate to 1.1 Mbps and reloading mttcan seems to recover.

  • This may be a regression introduced in R28.4 mb1_prod.bin. We are able to use the R28.1 mb1_prod.bin with the R28.4 flashing tools and it seems to resolve the issue. We can no longer reproduce the intermittent bus off error.

We have not seen any issues so far using 28.1 mb1_prod.bin with the 28.4 flashing tools.
To be sure though, is there any risk in what we are doing?

hello john.chen1,

it’s okay to use the r28.1 mb1 binary for your development.
there’s boot related known issues which having workarounds, please refer to r28.4 release notes for more details.

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.