CAN drops messages

EvilPictureBook · January 11, 2018, 8:05pm

Hi,

I was wondering if someone could help me. We’re trying to use the TX2 for control. We’re using both of the CAN interfaces, and each is controlling 7 devices reporting messages at 100Hz.

Everything runs fine for about 60 minutes, but at some point we start missing messages. This is the output of dmesg covering the operation period:
Note: the system did not fail during the first error messages (6082 timestamps) only at the second set of errors.

[ 6082.561654] mttcan c320000.mttcan can1: mttcan_poll_ir: some msgs lost on in Q0
[ 6082.569699] mttcan c310000.mttcan can0: mttcan_poll_ir: some msgs lost on in Q0
[ 6082.574989] mttcan c310000.mttcan can0: No Tx space left
[ 6082.575007] mttcan c310000.mttcan can0: No Tx space left
[ 6082.575025] mttcan c310000.mttcan can0: No Tx space left
[ 6082.575041] mttcan c310000.mttcan can0: No Tx space left
[ 6082.575058] mttcan c310000.mttcan can0: No Tx space left
[ 6082.575073] mttcan c310000.mttcan can0: No Tx space left
[ 6082.575089] mttcan c310000.mttcan can0: No Tx space left
[ 6082.575105] mttcan c310000.mttcan can0: No Tx space left
[ 6082.582590] mttcan c320000.mttcan can1: No Tx space left
[ 6082.582608] mttcan c320000.mttcan can1: No Tx space left
[ 6082.630619] mttcan c320000.mttcan can1: mttcan_poll_ir: some msgs lost on in Q0
[ 6082.640166] mttcan c310000.mttcan can0: mttcan_poll_ir: some msgs lost on in Q0
[ 6082.648519] mttcan c320000.mttcan can1: mttcan_poll_ir: some msgs lost on in Q0
[ 6290.128231] mmc1: sdhci_cmd_irq 2634 SDHCI_INT_CRC intmask: 60001 Interface clock = 204000000Hz
[ 6290.136947] sdhci: =========== REGISTER DUMP (mmc1)===========
[ 6290.142337] mttcan_start_xmit: 142 callbacks suppressed
[ 6290.142357] mttcan c320000.mttcan can1: No Tx space left
[ 6290.142390] mttcan c320000.mttcan can1: No Tx space left
[ 6290.142421] mttcan c320000.mttcan can1: No Tx space left
[ 6290.142448] mttcan c320000.mttcan can1: No Tx space left
[ 6290.142478] mttcan c320000.mttcan can1: No Tx space left
[ 6290.142504] mttcan c320000.mttcan can1: No Tx space left
[ 6290.142532] mttcan c320000.mttcan can1: No Tx space left
[ 6290.142558] mttcan c320000.mttcan can1: No Tx space left
[ 6290.145258] mttcan c310000.mttcan can0: No Tx space left
[ 6290.145292] mttcan c310000.mttcan can0: No Tx space left
[ 6290.201058] sdhci: Sys addr: 0x00000000 | Version:  0x00000404
[ 6290.206891] sdhci: Blk size: 0x00007080 | Blk cnt:  0x00000000
[ 6290.212723] sdhci: Argument: 0x12003e00 | Trn mode: 0x00000013
[ 6290.218555] sdhci: Present:  0x01fb0000 | Host ctl: 0x00000016
[ 6290.224385] sdhci: Power:    0x00000001 | Blk gap:  0x00000000
[ 6290.230217] sdhci: Wake-up:  0x00000000 | Clock:    0x00000007
[ 6290.236047] sdhci: Timeout:  0x0000000e | Int stat: 0x00000000
[ 6290.241880] sdhci: Int enab: 0x02ff000b | Sig enab: 0x02fc000b
[ 6290.247709] sdhci: AC12 err: 0x00000000 | Slot int: 0x00000000
[ 6290.253538] sdhci: Caps:     0x3f6cd08c | Caps_1:   0x18006f73
[ 6290.259369] sdhci: Cmd:      0x0000341a | Max curr: 0x00000000
[ 6290.265198] sdhci: Host ctl2: 0x0000300b
[ 6290.269124] sdhci: ADMA Err: 0x00000000 | ADMA Ptr: 0x00000000fc200010
[ 6290.275646] sdhci: ===========================================
[ 6290.281780] mttcan c310000.mttcan can0: mttcan_poll_ir: some msgs lost on in Q0
[ 6290.289673] mttcan c320000.mttcan can1: mttcan_poll_ir: some msgs lost on in Q0
[ 6290.290522] sdhci-tegra 3440000.sdhci: Tuning already done, restoring the best tap value : 60
[ 6290.309753] mttcan c310000.mttcan can0: mttcan_poll_ir: some msgs lost on in Q0
[ 6290.317743] mttcan c320000.mttcan can1: mttcan_poll_ir: some msgs lost on in Q0
[ 6290.326153] mttcan c310000.mttcan can0: mttcan_poll_ir: some msgs lost on in Q0
[ 6290.333868] mttcan c320000.mttcan can1: mttcan_poll_ir: some msgs lost on in Q0

What has caused this error, and how can we avoid it?

spatra · January 16, 2018, 3:21am

Hi,

We are making the setup ready and trying to reproduce the issue locally.
Will keep updated regarding the information/fix.

Also please let us know what transceivers are being used at your setup. We are using Transceiver - TI SN65HVD230 at our local setup.

Thanks & Regards,
Sandipan

EvilPictureBook · January 16, 2018, 3:40pm

Hi,

Thank you for looking into this. We’re using the TI TCAN337 transceivers.

Please let me know if there’s any other information I could get you.

EvilPictureBook · January 22, 2018, 2:41pm

Hi,

I was wondering if you had any luck with your testing?

Thanks.

Bibek · January 26, 2018, 11:12pm

Hi,

Can you share the complete log?
System does not seems to be in good state with sdhci register dump logging also see in between.

Thanks
Bibek

spatra · January 29, 2018, 8:33am

Hi,

Also can you provide the details about your connection topology.
As you have said there are 7 devices connected to each of the Tegra controllers, we would like to know exactly about your connection setup.

Till now, we have not seen any transfer errors, though it has run over two hours and odd.
However, we are using each controllers with one device at our setup (overall 2 nodes).

We would like to get these details from you.

Complete log
Connection topology and the complete connection and data transfer scenario.
What types of devices are being used?

These things can help us debugging further.

Thanks & Regards,
Sandipan

EvilPictureBook · January 30, 2018, 3:15pm

Hi,

I’ll have access to that specific hardware later in the week, so I can provide more details then.

In the mean time, we think that this is a sdhci and not a CAN issue (as suggested by @bbasu) . . .

Thanks.

EvilPictureBook · February 1, 2018, 8:27pm

Hi,

So I had a whole day with the robot, and we managed to trigger the same thing to happen. I have dmesg logs and journald logs for the entire session.

I’m attaching the dmesg logs, the journald logs, (events start around 2018-02-01T19:32:52,193008+0000). The candump logs are too big to attach here, but I can email a google drive link if you’d like.

Also, please not that we do not have an external MMC card installed.

A bit more detail about the system. We’re running two chains of 7 devices each. 6 of which are motor controllers, and one is a robot gripper controller.

Motor controllers (for each robot arm)

6 x Ingenia pluto motor controller: http://ingeniamc.com/products/pluto-digital-servo-drive
→ 3 of: PLU-8/48-C
→ 3 of: PLU-5/48-C
1 of: TI tm4c129 microprocessor running the same TCAN337 transceivers.

Please let me know if ther’s more info you need.

Thanks.
full_log.dmesg.txt (83.5 KB)
full_log.journal.txt (301 KB)

spatra · February 12, 2018, 9:22am

Hi,

It is being suspected that one of the connected device is going wrong after sometime and it is out of reach. This makes state of this node to ERROR PASSIVE and thn to BUS-OFF.

Can you please check what is the node status after you get the failure dmesg log.

As per CAN protocol:

The state of a node will be decided based on its error counters TEC and REC. If TEC and REC less than or equal to 127 means bus node is in error active state that means normal operation, if duo exceeds the count 127 and duo are less than 255 means error passive state and if the duo exceeds the count 255 means bus off.
Here, bus off means the node which has reached the counter value > 255, that node will not be in position to transmit or receive messages anymore and this will be known by its error counters and that node will be automatically withdrawn from the network. (Bus off: The bus off means, state of a node, not the CAN bus off). so the nodes which are having legitimate counter values can communicate normally.

Thanks & Regards,
Sandipan

EvilPictureBook · February 19, 2018, 10:47pm

Hi,

Thanks for the response. However, I don’t believe the issue is what you suggested.

All nodes on the CAN bus seem to stay in operational state, no bus errors are reported. The system starts working again if the code on the tegra is restarted (without resetting the tegra’s CAN controller).

We are still suspecting the sdhci driver. When sdhci does a register dump and clock tap tuning, do user space processes still execute? If not, how long does the sdhci driver tie up the system?
Given that the interrupt was thrown by mmc1 (the external memory card according to our reading), is it possible to bring down the mmc1 interface, or to mask its interrupts?

Thanks.

Bibek · February 20, 2018, 5:38am

wifi is connected on mmc1. You can disable the controller by setting the status as disabled for the node 3440000.sdhci in device tree
or quick hack ( may not work) is to not load brcm driver.
rename the folder /lib/firmware/brcm and reboot.

Meanwhile I will check why the dump is cominig

Charles_B · March 22, 2018, 3:02pm

Hi,

was a solution found for this?

We are seeing a similar error message:

Feb 11 16:45:32 tegra-ubuntu kernel: mttcan c310000.mttcan can0: mttcan_poll_ir: some msgs lost on in Q0

This is on a very simple setup - the TX2 (with SN65HVD233MDREP transceivers) connected over a short length (20cm) of wiring to a Vector VN1610 (with appropriate termination). The bus speed is 1 Mbps. We are running Canalyzer to send out various messages to test the setup with.

Everything seems fine with 1000 messages / second on one bus (at least over the timescale tested), but increasing this rate brings on the error message shown above - the higher the rate of messages sent, the sooner the error is seen, but even at 2000 messages / second we see the error within a few minutes. No error is seen on the bus, and it shows up as an overrun on the CAN0 statistics. CPU% used is low, as is the system load.

Is there an expected message throughput capability on the CAN interfaces?

Thanks,

Charles

On re-reading the original comment, apologies if this is a totally different problem to the one initially brought up!

EvilPictureBook · March 22, 2018, 4:25pm

Are you guys also loosing the ability to receive messages until the socket is reopened?

Charles_B · March 22, 2018, 4:45pm

No, it’s still working, just dropping the occasional message, so I think this is a separate issue to the one you were encountering (sorry for the inadvertent thread hijack!).

I’m guessing you’re not too bothered about the dropped messages you are seeing prior to the register dump, which is the part of your comment that I’m having problems with!

EvilPictureBook · March 22, 2018, 5:16pm

The issues could be related . . . But yeah, I’m mostly concerned with the system not being able to receive at all.

Bibek · April 10, 2018, 10:49am

This error message seen for SDIO device and it seems it is a known bug with Broadcom wifi chip.
Did you tried the steps mentioned to disable wifi

EvilPictureBook · April 10, 2018, 2:25pm

I haven’t tried it yet. The system is in use (even if it requires restarting the controllers) so I was waiting for some confirmation.

I will try it as soon as I have access to the robot again.

Could I please get some more information on disabling the node in the device tree? I’ve never worked with it before. . . specifically, what is it, where do I find it, and what do I need to do after making the changes . . . . yeah . . . this one is new to me so I will do some googling . . .

Would editing the device tree require recompiling anything on the system? If so, is there an alternative (you mentioned one)?

I will try removing the /lib/firmware/brcm folder.

Would blacklisting the kernel module (is it a module or is it built into the kernel?) work as well?

Also, after trying any of these, how can I check that the device/module is disabled properly?

Thank you!

---- would rfkill work? ----
https://devtalk.nvidia.com/default/topic/1010244/jetson-tx2/disable-wireless-and-bluetooth-on-jetson-tx2

albertr · April 10, 2018, 3:49pm

Hi bbasu,

So we need to change status = “okay” to status = “disabled” in the following device tree snippet below?
Do we need to change anything in power regulators to power Broadcom radio completely off so it won’t drain any current?

-albertr

sdhci@3440000 {
                compatible = "nvidia,tegra186-sdhci";
                reg = <0x0 0x3440000 0x0 0x210>;
                interrupts = <0x0 0x40 0x4>;
                max-clk-limit = <0xc28cb00>;
                ddr-clk-limit = <0x2dc6c00>;
                tap-delay = <0xb>;
                trim-delay = <0x5>;
                nvidia,ddr-tap-delay = <0xb>;
                ddr-trim-delay = <0x5>;
                bus-width = <0x4>;
                ignore-pm-notify;
                mmc-ocr-mask = <0x0>;
                keep-power-in-suspend;
                non-removable;
                cap-mmc-highspeed;
                cap-sd-highspeed;
                pwrdet-support;
                compad-vref-3v3 = <0x1>;
                compad-vref-1v8 = <0x2>;
                uhs-mask = <0x8>;
                pll_source = "pll_p";
                resets = <0xd 0x23>;
                reset-names = "sdmmc";
                clocks = <0xd 0x4c 0xd 0x10d>;
                clock-names = "sdmmc", "pll_p";
                #stream-id-cells = <0x1>;
                pad-controllers = <0x10 0x27>;
                pad-names = "sdmmc";
                nvidia,en-periodic-calib;
                force-non-removable-rescan;
                status = "okay";
                vqmmc-supply = <0x11>;
                vmmc-supply = <0xe>;
                linux,phandle = <0x109>;
                phandle = <0x109>;
...

EvilPictureBook · April 11, 2018, 11:03pm

Where did you find that file? Did you decompile it from /proc/device-tree ?

I’m kinda really lost on how to go about doing this.

albertr · April 11, 2018, 11:47pm

You can decompile your binary device tree, change it there, compile it and flash it back.
Or you if are building your own device tree from Nvidia’s sources, I think it’s defined in following file:

hardware/nvidia/platform/t18x/common/kernel-dts/t18x-common-platforms/tegra186-quill-common-p3310-1000-a00.dtsi

-albertr