CAN bus stops working after awhile

I’m seeing a problem when running CAN bus for a few hours, the transmit stops working and no errors in the statistics are indicated. I was able to repro the issue using canfdtest which is part of can-utils, so this excludes any of our application code. I also tested using PEAK USB-CAN X6 and same problem occurs.

I’m using AutoChauffeur running 5.0.5.0b. I have CAN_5 looped to CAN_6 with 120 ohm termination between them.

Repro sequence

Tegra-A
sudo apt-get install can-utils
sudo ip link set can0 type can bitrate 500000
sudo ip link set up can0
canfdtest -v can0

Tegra-B
sudo apt-get install can-utils
sudo ip link set can1 type can bitrate 500000
sudo ip link set up can1
canfdtest -v -g can1 -l 999999999

After some random time, I’ve seen 10 min to 5 hrs, the test gets stuck and ifconfig counters show nothing happening. Typically the CAN which is configured as “listener” (without the -g) is the one stuck and process shows 100% CPU utilization. I can’t kill it with Ctrl-C, only a kill -9.

Dear gordon1zrra,

Could you file a bug for your topic with detail description? We will look into it.
Please login to https://developer.nvidia.com/drive with your credentials. Please check MyAccount->MyBugs->Submit a new bug to file bug.
Please share ID here to follow up. Thanks.

OK, I created ticket #2217751. Is it common procedure to close all open tickets after a while because I had a few tickets already open but they were recently closed without any comments.

Saw another problem when running same test where CAN messages are received out of order. Created ticket #2220832.

hello gordon1zrra,

seems we only reproduce CAN bus stops working issue here,
receiver side stuck there and cannot waiting for expect packet.

share the test failure message from our side,

Tegra-A
...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Tegra-B
...Databyte 0 mismatch !
expected: 0078: [8] 2f 30 31 32 33 34 35 36
received: 0078: [8] 39 3a 3b 3c 3d 3e 3f 40
Databyte 1 mismatch !
expected: 0078: [8] 2f 30 31 32 33 34 35 36
received: 0078: [8] 39 3a 3b 3c 3d 3e 3f 40
Databyte 2 mismatch !
expected: 0078: [8] 2f 30 31 32 33 34 35 36
received: 0078: [8] 39 3a 3b 3c 3d 3e 3f 40
Databyte 3 mismatch !
expected: 0078: [8] 2f 30 31 32 33 34 35 36
received: 0078: [8] 39 3a 3b 3c 3d 3e 3f 40
...

could you help to determine you got CAN message out-of-order issue?
please also share the steps about how to reproduce that issue.
thanks

Hi,
The error message “Databyte X mismatch” is the out of order issue. When stuck error occurs, there is NO error messages on either terminal.

From my testing, it seems out of order issue happens about twice as frequently as stuck issue. The canfdtest quits as soon as it detects any errors, so it’s harder to find the stuck issue.

I made a modification to canfdtest (I’ll attach google link) which continues test after detecting out of order issue to try to get the stuck issue.

https://drive.google.com/file/d/10eoxgCyH3t-oDKKFg8Zqj2ljlUfc-Yk4/view?usp=sharing

to run it replace any canfdtest call to use /canfdtest_v2 instead, so:

Tegra-A
sudo apt-get install can-utils
sudo ip link set can0 type can bitrate 500000
sudo ip link set up can0
canfdtest_v2 -v can0

Tegra-B
sudo apt-get install can-utils
sudo ip link set can1 type can bitrate 500000
sudo ip link set up can1
canfdtest_v2 -v -g can1 -l 999999999

hello gordon1zrra,

could you please try to reproduce this issue again with the patch in the attachment,
this has configuration to use only buffer mode for Tx and only fifo mode for Rx side.
thanks

0001-p2379-mttcan-use-only-Rx-fifo-and-Tx-buffer.patch.tar.gz (495 Bytes)

Thanks for the patch. I’ve updated my board and will run over the weekend.

To verify patch was applied I ran

hexdump /sys/devices/c310000.mttcan/of_node/mram-params
0000000 0000 0000 0000 1000 0000 1000 0000 1000
0000010 0000 0000 0000 0000 0000 1000 0000 1000
0000020 0000 1000

and it looks like changes took effect.

Hi JerryChang,

After running over the weekend (about 60hrs) it looks like the stuck issue is fixed. I still see the out of order message issue but for us this isn’t critical since our application can deal with out of sequence messages.

Thanks again for the help!

hello gordon1zrra,

we are planning to move to FIFO only mode for both TX and RX.
could you please do another favor to have another testing with the patch in the attachment, (Aug10_Topic1036894_patch.tar.gz)
thanks

Aug10_Topic1036894_patch.tar.gz (500 Bytes)

I’ll try it out, thanks.

Hi JerryChang,

I ran the CAN test for over 140 hrs and I didn’t see any problems. The change also fixed the out of order issue.