I was wondering if someone could help me. We’re trying to use the TX2 for control. We’re using both of the CAN interfaces, and each is controlling 7 devices reporting messages at 100Hz.
Everything runs fine for about 60 minutes, but at some point we start missing messages. This is the output of dmesg covering the operation period:
Note: the system did not fail during the first error messages (6082 timestamps) only at the second set of errors.
[ 6082.561654] mttcan c320000.mttcan can1: mttcan_poll_ir: some msgs lost on in Q0
[ 6082.569699] mttcan c310000.mttcan can0: mttcan_poll_ir: some msgs lost on in Q0
[ 6082.574989] mttcan c310000.mttcan can0: No Tx space left
[ 6082.575007] mttcan c310000.mttcan can0: No Tx space left
[ 6082.575025] mttcan c310000.mttcan can0: No Tx space left
[ 6082.575041] mttcan c310000.mttcan can0: No Tx space left
[ 6082.575058] mttcan c310000.mttcan can0: No Tx space left
[ 6082.575073] mttcan c310000.mttcan can0: No Tx space left
[ 6082.575089] mttcan c310000.mttcan can0: No Tx space left
[ 6082.575105] mttcan c310000.mttcan can0: No Tx space left
[ 6082.582590] mttcan c320000.mttcan can1: No Tx space left
[ 6082.582608] mttcan c320000.mttcan can1: No Tx space left
[ 6082.630619] mttcan c320000.mttcan can1: mttcan_poll_ir: some msgs lost on in Q0
[ 6082.640166] mttcan c310000.mttcan can0: mttcan_poll_ir: some msgs lost on in Q0
[ 6082.648519] mttcan c320000.mttcan can1: mttcan_poll_ir: some msgs lost on in Q0
[ 6290.128231] mmc1: sdhci_cmd_irq 2634 SDHCI_INT_CRC intmask: 60001 Interface clock = 204000000Hz
[ 6290.136947] sdhci: =========== REGISTER DUMP (mmc1)===========
[ 6290.142337] mttcan_start_xmit: 142 callbacks suppressed
[ 6290.142357] mttcan c320000.mttcan can1: No Tx space left
[ 6290.142390] mttcan c320000.mttcan can1: No Tx space left
[ 6290.142421] mttcan c320000.mttcan can1: No Tx space left
[ 6290.142448] mttcan c320000.mttcan can1: No Tx space left
[ 6290.142478] mttcan c320000.mttcan can1: No Tx space left
[ 6290.142504] mttcan c320000.mttcan can1: No Tx space left
[ 6290.142532] mttcan c320000.mttcan can1: No Tx space left
[ 6290.142558] mttcan c320000.mttcan can1: No Tx space left
[ 6290.145258] mttcan c310000.mttcan can0: No Tx space left
[ 6290.145292] mttcan c310000.mttcan can0: No Tx space left
[ 6290.201058] sdhci: Sys addr: 0x00000000 | Version: 0x00000404
[ 6290.206891] sdhci: Blk size: 0x00007080 | Blk cnt: 0x00000000
[ 6290.212723] sdhci: Argument: 0x12003e00 | Trn mode: 0x00000013
[ 6290.218555] sdhci: Present: 0x01fb0000 | Host ctl: 0x00000016
[ 6290.224385] sdhci: Power: 0x00000001 | Blk gap: 0x00000000
[ 6290.230217] sdhci: Wake-up: 0x00000000 | Clock: 0x00000007
[ 6290.236047] sdhci: Timeout: 0x0000000e | Int stat: 0x00000000
[ 6290.241880] sdhci: Int enab: 0x02ff000b | Sig enab: 0x02fc000b
[ 6290.247709] sdhci: AC12 err: 0x00000000 | Slot int: 0x00000000
[ 6290.253538] sdhci: Caps: 0x3f6cd08c | Caps_1: 0x18006f73
[ 6290.259369] sdhci: Cmd: 0x0000341a | Max curr: 0x00000000
[ 6290.265198] sdhci: Host ctl2: 0x0000300b
[ 6290.269124] sdhci: ADMA Err: 0x00000000 | ADMA Ptr: 0x00000000fc200010
[ 6290.275646] sdhci: ===========================================
[ 6290.281780] mttcan c310000.mttcan can0: mttcan_poll_ir: some msgs lost on in Q0
[ 6290.289673] mttcan c320000.mttcan can1: mttcan_poll_ir: some msgs lost on in Q0
[ 6290.290522] sdhci-tegra 3440000.sdhci: Tuning already done, restoring the best tap value : 60
[ 6290.309753] mttcan c310000.mttcan can0: mttcan_poll_ir: some msgs lost on in Q0
[ 6290.317743] mttcan c320000.mttcan can1: mttcan_poll_ir: some msgs lost on in Q0
[ 6290.326153] mttcan c310000.mttcan can0: mttcan_poll_ir: some msgs lost on in Q0
[ 6290.333868] mttcan c320000.mttcan can1: mttcan_poll_ir: some msgs lost on in Q0
What has caused this error, and how can we avoid it?
Also can you provide the details about your connection topology.
As you have said there are 7 devices connected to each of the Tegra controllers, we would like to know exactly about your connection setup.
Till now, we have not seen any transfer errors, though it has run over two hours and odd.
However, we are using each controllers with one device at our setup (overall 2 nodes).
We would like to get these details from you.
Complete log
Connection topology and the complete connection and data transfer scenario.
So I had a whole day with the robot, and we managed to trigger the same thing to happen. I have dmesg logs and journald logs for the entire session.
I’m attaching the dmesg logs, the journald logs, (events start around 2018-02-01T19:32:52,193008+0000). The candump logs are too big to attach here, but I can email a google drive link if you’d like.
Also, please not that we do not have an external MMC card installed.
A bit more detail about the system. We’re running two chains of 7 devices each. 6 of which are motor controllers, and one is a robot gripper controller.
It is being suspected that one of the connected device is going wrong after sometime and it is out of reach. This makes state of this node to ERROR PASSIVE and thn to BUS-OFF.
Can you please check what is the node status after you get the failure dmesg log.
As per CAN protocol:
The state of a node will be decided based on its error counters TEC and REC. If TEC and REC less than or equal to 127 means bus node is in error active state that means normal operation, if duo exceeds the count 127 and duo are less than 255 means error passive state and if the duo exceeds the count 255 means bus off.
Here, bus off means the node which has reached the counter value > 255, that node will not be in position to transmit or receive messages anymore and this will be known by its error counters and that node will be automatically withdrawn from the network. (Bus off: The bus off means, state of a node, not the CAN bus off). so the nodes which are having legitimate counter values can communicate normally.
Thanks for the response. However, I don’t believe the issue is what you suggested.
All nodes on the CAN bus seem to stay in operational state, no bus errors are reported. The system starts working again if the code on the tegra is restarted (without resetting the tegra’s CAN controller).
We are still suspecting the sdhci driver. When sdhci does a register dump and clock tap tuning, do user space processes still execute? If not, how long does the sdhci driver tie up the system?
Given that the interrupt was thrown by mmc1 (the external memory card according to our reading), is it possible to bring down the mmc1 interface, or to mask its interrupts?
wifi is connected on mmc1. You can disable the controller by setting the status as disabled for the node 3440000.sdhci in device tree
or quick hack ( may not work) is to not load brcm driver.
rename the folder /lib/firmware/brcm and reboot.
Feb 11 16:45:32 tegra-ubuntu kernel: mttcan c310000.mttcan can0: mttcan_poll_ir: some msgs lost on in Q0
This is on a very simple setup - the TX2 (with SN65HVD233MDREP transceivers) connected over a short length (20cm) of wiring to a Vector VN1610 (with appropriate termination). The bus speed is 1 Mbps. We are running Canalyzer to send out various messages to test the setup with.
Everything seems fine with 1000 messages / second on one bus (at least over the timescale tested), but increasing this rate brings on the error message shown above - the higher the rate of messages sent, the sooner the error is seen, but even at 2000 messages / second we see the error within a few minutes. No error is seen on the bus, and it shows up as an overrun on the CAN0 statistics. CPU% used is low, as is the system load.
Is there an expected message throughput capability on the CAN interfaces?
Thanks,
Charles
On re-reading the original comment, apologies if this is a totally different problem to the one initially brought up!
No, it’s still working, just dropping the occasional message, so I think this is a separate issue to the one you were encountering (sorry for the inadvertent thread hijack!).
I’m guessing you’re not too bothered about the dropped messages you are seeing prior to the register dump, which is the part of your comment that I’m having problems with!
I haven’t tried it yet. The system is in use (even if it requires restarting the controllers) so I was waiting for some confirmation.
I will try it as soon as I have access to the robot again.
Could I please get some more information on disabling the node in the device tree? I’ve never worked with it before. . . specifically, what is it, where do I find it, and what do I need to do after making the changes . . . . yeah . . . this one is new to me so I will do some googling . . .
Would editing the device tree require recompiling anything on the system? If so, is there an alternative (you mentioned one)?
I will try removing the /lib/firmware/brcm folder.
Would blacklisting the kernel module (is it a module or is it built into the kernel?) work as well?
Also, after trying any of these, how can I check that the device/module is disabled properly?
So we need to change status = “okay” to status = “disabled” in the following device tree snippet below?
Do we need to change anything in power regulators to power Broadcom radio completely off so it won’t drain any current?
You can decompile your binary device tree, change it there, compile it and flash it back.
Or you if are building your own device tree from Nvidia’s sources, I think it’s defined in following file: