CAN errors with a specific CAN frame

Hi,
We have a 1Mpbs CAN bus that has been working fine until now. On this CAN bus there are different nodes sending messages without problem. However, after connecting the Nvidia system to it we are facing a continuous CAN error when a specific CAN frame is sent by one of the nodes.

Here are some images showing the problem:
image
image

On the above images can be seen the frame that Nvidia system rejects. This is:
• ID: 0x200.
• DLC: 8 bytes
• Data: 05 00 00 FD 7F FF 7F 01

If we change the ID to 0x199 there are no errors:
image
image

In addition, if we maintain the ID but the byte before the error bit position is changed, there are no errors neither:
image
Notice that the data bytes sent are 05 00 00 FD 7F 55 7F 01 instead of 05 00 00 FD 7F FF 7F 01.

image

Here are the details about our Nvidia AGX setup:
• The DB9 CAN connector 1&2 is being used trough pins 1 and 8.
• We are using CAN socket interface configured to 1Mbps bit rate.
• The Nvidia does not send any CAN data message.

Here are the observation we got after some tests:

  • No matter which node sends the problematic CAN message, the result is the same.
  • The error is still present if only the problematic CAN message is sent on the bus.
  • If we disconnect the Nvidia from the CAN bus the errors disappear.
  • No matter on which bus port the Nvidia is connected to.
  • We have tried different CAN bit time configuration on both Nvdia and the node that sends the message with no results.
  • We have tried variating the ‘sjw’ parameter on the Nvidia according to this topic (CAN bus has error after connecting Jetson - Jetson & Embedded Systems / Jetson TX2 - NVIDIA Developer Forums) with no results.

Do you know why this is happening and how can be solved?
Thanks.

Please provide the following info (check/uncheck the boxes after clicking “+ Create Topic”):
Software Version
DRIVE OS Linux 5.2.0
DRIVE OS Linux 5.2.0 and DriveWorks 3.5
NVIDIA DRIVE™ Software 10.0 (Linux)
NVIDIA DRIVE™ Software 9.0 (Linux)
other DRIVE OS version
other

Target Operating System
Linux
QNX
other

Hardware Platform
NVIDIA DRIVE™ AGX Xavier DevKit (E3550)
NVIDIA DRIVE™ AGX Pegasus DevKit (E3550)
other

SDK Manager Version
1.5.0.7774
other

Host Machine Version
native Ubuntu 18.04
other

Hi @jaime.santiagolopez ,

Have you seen any suspicious messages on Xavier for the specific CAN frame?
Could you help to simplify the reproducing steps? Is it possible to reproduce by sending the specific CAN frame from the host system to the target system?

Hi,

After realizing about the CAN error due to an specific CAN frame, we disconnected all the nodes from the bus but the NVIDIA system and then we just connect a pc with a Vector CAN tool for sending the problematic CAN message from there. So on the bus there were only a Vector CAN tool and the NVIDIA.

By sending the CAN message from the Vector CANalyzer software we got the same result. In fact, the images from the original post were taken when doing this test.

Regarding the last question, about sending the specific CAN frame from the host system to the target system, do you mean sending it from one nvidia CAN connector to another nvidia CAN connector? If so, we have not tried it with this specific message, we will do it tomorrow.

Did you run any command on DRIVE AGX target system? Did you see any suspicious kernel messages on target system?
Please see if information in the following helps.

@jaime.santiagolopez ,

I had the similar problem and more information you can find in posts tagged here.

While I am not entirely sure what I did that fixed it but two things i did are as follows:

  1. Downloaded kernel source using latest SDK manager
  2. Cross compiled the kernel source and flashed the AGX board.

Please try and if that helps you. Also, do you mind sharing how did you change bit timing for CAN on XavierA? Thanks.

Rishit

Hello again,

@VickNV, so far we have not used any command to see suspicious messages on Xavier. Next day we will check the kernel log using dmesg.

@rborad, we used the next command to change the CAN bit timming on Xavier:

ip link set canX type can tq 125 prop-seg 6 phase-seg1 7 phase-seg2 2 sjw 1

The only thing is that bit timming parameters change to default values if you reboot the Nvidia.

Here I share what we did last day:

We have tried the following set up to simplify the reproduction of the error:
image

-Loopback connection between CAN2 and CAN6 ports (can0 and can1 from linux), with a short cable and termination resistors (we have tried with one 120 ohm resistor or two 120ohm resistors in paralell as in the picture).
-We set both can bus to 1MHz: (sudo ip link set canX type can bitrate 1000000)
-candump can1 (visualize messages recived in can1)
-cangen can0 (send random msgs on can0)

Results:
-After some short time sending random messages, can0 goes to BUS-OFF. The errors are not totally random, because when you find a specific frame given failure (like the one mentioned on the first post), sending it causes failure almost every time, while if you continuously send a message that is okey, it “never” breaks.

-Same tests with bus bitrate of 500000 works perfectly well.
image

Any idea on what can be the problem and how to solve it?

Thank you in advance!

Hi @jaime.santiagolopez ,

Please help to simplify the reproduction steps by modifying How to Test CAN.

According to your last post, it looks you only changed “bitrate” from 500000 to 1000000 and “cansend can0 220##150” to send a specific frame (as you mentioned in your first post), right? If you can list out your steps as the document page, we can easily discuss with the team. Thanks.

Hi @VickNV ,

We have followed the steps of the Test CAN link you shared but with 1000000 bitrate instead of 500000 and it runs ok.

Then, we have tried the same test with the problematic CAN frame and it failed. Here I copy what we got executing the dmesg command:

Thanks.

Please share your cansend command for sending the problematic CAN frame. Thanks.

Hi, this is the cansend command:
cansend can0 200#050000FD7FFF7F01

After sending 20 or 30 messages it failed.

We have also repeated the same test sending a random 8 bytes message (220#5273659182563419) for an hour and it did not fail.

We can reproduce the issue on my side now.
We will discuss internally and update you. Thanks.

Hi,

Do you have any update on the problem?

Our team just started on this issue. We’ll let you know once any update. Thanks.