MTTCAN on Orin NX issues

Hello,

We are using the CANBUS on Orin NX, where the mttcan module is loaded for its operation. This system operates seamlessly on an error-free network, and has been tested successfully on a network with another Orin NX.

However, we have encountered an issue when there are errors introduced either by a third-party device or by inadequate network cabling. These errors persist on the MTTCAN even after the CAN BUS cables have been entirely disconnected from the Orin NX.

We have observed these errors manifesting in two distinct scenarios:

  1. When the command watch -d -n 0.1 ip -d -s link show is used, it does not show any updates on statistics even when cangen can1 is running.
  2. The system detects a “Stuff Error”. This type of error is typically reported after a real issue occurs on the CAN, but it persists even when the cables are completely disconnected.

In summary, we are experiencing persistent errors on the MTTCAN due to faulty network conditions or third-party device interference. These errors continue to affect our CANBUS operation on Orin NX, despite the total disconnection of the CAN BUS cables. The errors exhibit as unresponsive updates in link show statistics and persistent “Stuff Errors”. We are seeking solutions or suggestions to rectify these issues.

EDIT1:

EDIT2:
output of

dtc -I fs /sys/firmware/devicetree/base 2&>/dev/null | grep -C 10 mtt

  			gpios = <0x05 0x30 0x01>;
		};
	};

	spe-pmu {
		interrupts = <0x01 0x05 0x04>;
		compatible = "arm,statistical-profiling-extension-v1";
		status = "disabled";
	};

	mttcan@c310000 {
		bittimes = <0x7d 0x00 0x0f 0x13 0x03 0x00 0x03 0xfa 0x00 0x00 0xae 0x17 0x00 0x01 0xfa 0x00 0x07 0x13 0x03 0x00 0x02 0x1f4 0x00 0x03 0x13 0x03 0x00 0x03 0x3e8 0x00 0x01 0x10 0x06 0x00 0x03 0x7d0 0x00 0x00 0x10 0x06 0x00 0x02>;
		rx-config = <0x40 0x40 0x40>;
		tx-config = <0x00 0x10 0x00 0x40>;
		clock-names = "can_core\0can_host\0can\0pllaon";
		reg-names = "can-regs\0glue-regs\0msg-ram";
		resets = <0x02 0x04>;
		interrupts = <0x00 0x28 0x04>;
		clocks = <0x02 0x11c 0x02 0x0a 0x02 0x09 0x02 0x5e>;
		mram-params = <0x00 0x10 0x10 0x20 0x00 0x00 0x10 0x10 0x10>;
		compatible = "nvidia,tegra194-mttcan";
		status = "okay";
		reg = <0x00 0xc310000 0x00 0x144 0x00 0xc311000 0x00 0x32 0x00 0xc312000 0x00 0x1000>;
		phandle = <0x110>;
		reset-names = "can";
		pll_source = "pllaon";
		bitrates = <0x1f4 0x7d0>;
	};

	dce@d800000 {
		iommus = <0x06 0x08>;
--
		nvidia,sku = "699-13767-0000-300 G.3\0\0\0\0\0\0\0\0";
		linux,uefi-system-table = <0x04 0x645c0018>;
	};

	mods_test {
		compatible = "nvidia,mods_test";
		status = "disabled";
		phandle = <0x166>;
	};

	mttcan@c320000 {
		bittimes = <0x7d 0x00 0x0f 0x13 0x03 0x00 0x03 0xfa 0x00 0x00 0xae 0x17 0x00 0x01 0xfa 0x00 0x07 0x13 0x03 0x00 0x02 0x1f4 0x00 0x03 0x13 0x03 0x00 0x03 0x3e8 0x00 0x01 0x10 0x06 0x00 0x03 0x7d0 0x00 0x00 0x10 0x06 0x00 0x02>;
		rx-config = <0x40 0x40 0x40>;
		tx-config = <0x00 0x10 0x00 0x40>;
		clock-names = "can_core\0can_host\0can\0pllaon";
		reg-names = "can-regs\0glue-regs\0msg-ram";
		resets = <0x02 0x05>;
		interrupts = <0x00 0x2a 0x04>;
		clocks = <0x02 0x11d 0x02 0x0c 0x02 0x0b 0x02 0x5e>;
		mram-params = <0x00 0x10 0x10 0x20 0x00 0x00 0x10 0x10 0x10>;
		compatible = "nvidia,tegra194-mttcan";
		status = "disabled";
		reg = <0x00 0xc320000 0x00 0x144 0x00 0xc321000 0x00 0x32 0x00 0xc322000 0x00 0x1000>;
		phandle = <0x111>;
		reset-names = "can";
		pll_source = "pllaon";
		bitrates = <0x1f4 0x7d0>;
	};

	l2-cache20 {
		cache-size = <0x40000>;
--
		tegra_spkprot = "/aconnect@2a41000/ahub/spkprot@2908c00";
		tegra_asrc = "/aconnect@2a41000/ahub/asrc@2910000";
		funnel_ccplex1_out_port0 = "/funnel_ccplex1@26040000/out-ports/port@0/endpoint";
		cv2_hot_surface = "/thermal-zones/CV2-therm/trips/cv2-hot-surface";
		tegra_i2s1 = "/aconnect@2a41000/ahub/i2s@2901000";
		sdmmc1_1v8 = "/pmc@c360000/sdmmc1-1v8";
		tegra_agic_1 = "/aconnect@2a41000/agic-controller@2a51000";
		tegra_xhci_vf3 = "/xhci@3710000";
		nvdla1 = "/host1x@13e00000/nvdla1@158c0000";
		tegra_pwm7 = "/pwm@32e0000";
		mttcan1 = "/mttcan@c320000";
		pex_rst_c5_in_state = "/pinmux@2430000/pex_rst_c5_in";
		tegra_aon_gpio = "/gpio@c2f0000";
		funnel_major_in_port2 = "/funnel_major@24040000/in-ports/port@2/endpoint";
		cpu5_etm_out_port0 = "/cpu5_etm@27540000/out-ports/port/endpoint";
		cv0_hot_surface = "/thermal-zones/CV0-therm/trips/cv0-hot-surface";
		hsp_rce = "/tegra-hsp@b950000";
		p2u_nvhs_3 = "/cbb/p2u@03ec0000";
		p3767_vdd_3v3_pcie_isolate = "/fixed-regulators/regulator@12";
		tegra_dmic1 = "/aconnect@2a41000/ahub/dmic@2904000";
		pva0_ctx0n1 = "/host1x@13e00000/pva0/pva0_niso1_ctx0";
--
		spi1 = "/spi@c260000";
		p2u_gbe_0 = "/cbb/p2u@03f20000";
		funnel_ccplex1_in_port3 = "/funnel_ccplex1@26040000/in-ports/port@3/endpoint";
		dp_aux_ch0_i2c = "/i2c@31b0000";
		tegra_rce = "/rtcpu@bc00000";
		tegra_xhci_vf2 = "/xhci@36c0000";
		nvdla0 = "/host1x@13e00000/nvdla0@15880000";
		tegra_pwm6 = "/pwm@32d0000";
		host1x_ctx5n1 = "/host1x@13e00000/niso1_ctx5";
		tegra_usb_cd = "/usb_cd";
		mttcan0 = "/mttcan@c310000";
		funnel_major_in_port1 = "/funnel_major@24040000/in-ports/port@1/endpoint";
		generic_reserved = "/reserved-memory/generic_carveout";
		soc0_alert = "/soc0-throttle-alert";
		bpmp = "/bpmp";
		tegra_afc6 = "/aconnect@2a41000/ahub/afc@2907500";
		cam_i2c = "/i2c@3180000";
		cl1_3 = "/cpus/cpu@7";
		adma = "/aconnect@2a41000/adma@2930000";
		Tdiode_zone = "/thermal-zones/Tdiode_tegra";
		cpu4_etm_out_port0 = "/cpu4_etm@27440000/out-ports/port/endpoint";

EDIT3

image

EDIT4
Scope images,

we are getting good communication between two Orin NX

but after connecting external devices, with very long stub. communication broke. it seems that Orin NX devices cannot ack the messages as can be seen

Now, the problem, that after disconnecting the long stub, i.e. returning to 2 Orin NX with 2 Termination we are getting, High and Low line fixed and nothing is going on on the bus even that cangen can1 -v is running


( on both Orin NX devices.) In this case the error counter in watch -d -n 0.1 ip -d -s link show is continuously rising.

now, when trying to reset the ` can driver by running

rmmod mttcan
modprobe mttcan

the statistics zerod. but now disconnected from the world
i.e: no message rx/tx counting, no error counting rx/tx`

Hi alon2,

What’s your carrier board for Orin NX?
What’s your Jetpack version in use?

What device did you connect?
Is there any messages when you connect this device?
Could you share the full serial console log during the issue occurs?

Could it recover after you run these command to reset can driver?

Custom board

latest, 5.1.1

VBOX;
When connecting microchip can analyzer I can see the messages. Moreover, the pico analyzer shows the message (as in the original post) but with “valid” because the "ack’ bit is on. i.e microship can analyzer succeed to parse the CAN messages and ack the VBOX, where the OrinNX does not able to ack and it claims to have bad format.

I’ll try later. but mainly the messages continue on “Format Error.” even when the cable disconnected. so its something with the MTTCAN driver/controller that stays on its state without restarting.

it does not able to recover. the controller seems to be disconnected and does not produce any transmission on the CAN

My feel that it relates to our device tree configuration. can you spot any errors there

We would need the full serial console log or dmesg when you re-load can driver.

Does this issue could only be reproduced with external device?
I’m not clear about how do you hit the issue.
Could you provide the detailed reproduce steps on the devkit and we could verify locally?

  1. i’ll be able to provide the dmesg log/serial console log on Sunday,
  2. yeap, Our setup include 2 OrinNx, they can communicate with each other, but as soon as external/non jetson device is joining the network, the communication break. The OrinNx complains about “Format Error”, as can be seen in the scope, the messages are fine. This what makes me the that the issues with the internal clock definition of the mttcan.
  3. the Orin devkit does not have can transceivers, right? if it has I can try reproduce it on it

edit1 - clocks information :


/sys/kernel/debug/bpmp/debug/clk# cat pll_aon/rate
400000000

/sys/kernel/debug/bpmp/debug/clk# cat pll_aon/children
can2 can1
/sys/kernel/debug/bpmp/debug/clk# cat can1/rate
200000000
/sys/kernel/debug/bpmp/debug/clk# cat can1_core/rate
50000000
/sys/kernel/debug/bpmp/debug/clk# cat can1_host/rate
200000000

Yes, please help to provide the serial console log to check if there’s any errors.

How do you make another Jetson join the network? Is there any step to do this and maybe we could try to reproduce?

Yes, there’s no internal CAN transceiver in the devkit, you would need to get CAN transceivers to verify CAN transmission.

Regarding joining the network we just plug the devices high,low lines into the network with the CAN setting (baud rate and etc ). tested with 250kpbs

Could you provide the detailed steps for this?
Which I/O pin you are controlling for high/low?

HIGH & LOW comes from our CAN transceiver as the SOM (Orin NX) does not have one.

any other ideas?

after rmoving mttcan from kernel and reloading it ( note that we have also mcp2515 on board)

Is this your current issue?
Could you help to clarify how to reproduce this?

The dmesg seems as expected when you re-load the mttcan.

Hey!

To sum up the issues: We have three issues

1st: after short time, might be related to the number of messages per second, the CAN will report “Format error “ although the scope shows that the message is valid. This happens with external device communicate on the CAN network with OrinNX via MTTCAN.
This happens to us even with an external device that running “can gem” In high freq. I.e

“cangen can1 -v -g 1”

2nd:
After any Format error message on the CAN, the MTTCAN controller is stuck on this state. It pollutes the syslog/ dmesg with the same error. It stays that way even the external device is disconnected from the network.

3rd:

The MTTCAN controller failed to restart its state. Even when removing all CAN modules from the kernel and reloading them back. Even softest reboot does not bring back the MTTCAN controller. The scope shows nothing on the output of the controller on this state.

We tried getting into the MTTCAN registers but memdev failed, dmesg shows “unprivileged access “

I think we should start from the first issue.
Please share the full serial console log including the “Format error” you said.

Could you describe more detail about how to reproduce this?
Maybe the block diagram of your connections and the exact commands you used would be better for me to understand.

Can you share the full serial console log including the “Format error”?
Any updte to move this issue forward or you have fixed it?

Thanks

Hey
We did not solve it yet, we think the main issue is the fact that the controler cannot recover. Only disconnecting the power solve it at the moment.
here is a lot of logs from different runs on the ORIN.
our hypothesis: the kernel does not restart the controller correctly.

MobaXterm terminal output ORINNX - can error.zip (578.6 KB)

EDIT 1

try to shortcut for a milisecond the high and low. or high/low to ground. in our setup the controller is not able to recover

EDIT 2

more tricky way is

  1. set device A to 250Kbps, device B 250Kbps, test that everything is working
  2. change device B to 500Kbps, you will get alot of format error
  3. change device B back to 250Kbps, now device B can send message
  4. but device A cannot send or device B cannot receive.

EDIT 3

(suggestion use tmux )
on both devices:

  1. cangen can1 -g 2 -v
  2. watch -d -n 0.1 ip -d -s link show
  3. candump can1

on device B

  1. sudo ip link set can1 down
  2. sudo ip link set can1 type can bitrate 500000 berr-reporting on restart-ms 100
  3. sudo ip link set can1 up
  4. this will break can gen and can dump so re run (1,2,3) from first step
  5. this will print format error and bit0/1 error
  6. reset bitrate back to 250000
    7… sudo ip link set can1 down
  7. sudo ip link set can1 type can bitrate 250000berr-reporting on restart-ms 100
    9 sudo ip link set can1 up
  8. this will break can gen and can dump so re run (1,2,3) from first step

now verify if the two devices can send and receive

Do you mean the issue occur only when you configure it to 500Kbps?
Please use the same bitrate for both devices.

Are you testing with 2 Orin NX?
Could you share the block diagram of your connections?

Could you test it with re-load the module and share the logs?

$ sudo rmmod mttcan
$ sudo modprobe mttcan
$ sudo ip link set can1 down
$ sudo ip link set can1 type can bitrate 500000 berr-reporting on restart-ms 100
$ sudo ip link set can1 up

Do you mean the issue occur only when you configure it to 500Kbps?
Please use the same bitrate for both devices.

no, I meant that if the two devices are not sync with the same bitrate than MTTCAN controller cannot recover after restoring the same bitrate for the two devices

Are you testing with 2 Orin NX?
Could you share the block diagram of your connections?
Yes ,

Could you test it with re-load the module and share the logs?
we have two cases:

  1. if the error is a bitrate mismatch then reloading the mttcan module from kernel restore the CAN functionality
  2. if the error comes from the physical layer which result with Format error then even after disconnect compeltliy the entire network and restore clean version of it the above does not restore the CAN.
  3. the output in the logs are the same for the two cases


We found Orin CAN_H CAN_L short-circuit bus off and cannot recover which seems very similar to our problem but without solution there. Do you know about any solution for this one?

How do you know the error from physical layer? How do you trigger this kind of error?

Is your issue about entering into error passive state?

How do you know the error from physical layer? How do you trigger this kind of error?

Can be several options, noise cable, new device plug in a working CAN network, add/ing removing of termination while working

Is your issue about entering into error passive state ?

Our issue is that MTT Controller fails to recover from a failure. A CAN controller has few state until it cuts it self from the network. But we set it to restart it self. Moreover, when we get the MTT Controller failures the OS reports the MTT CAN is up but nothing happens there. This can be seen even with a scope, the CAN of the device is dead

EDIT 1

we saw the Jetpack 5.1.2 released few days ago, we tested it, the problem exists also in 5.1.2

EDIT 2

even on modprobe -r mttcan/ modprobe mttcan, sometimes only the receiving side is working. and we need to restart it again to get the sender to work as well.