CAN messages dropped

Hi,

We are seeing message loss on the CAN port with a few different configurations. Where message loss are indicated by the “error” counter displayed in ifconfig. The number of error packets corresponds (1:1) with warning messages from the mttcan driver- “mttcan c310000.mttcan can0: mttcan_poll_ir: some msgs lost on in Q0”.

Sample ifconfig and log output:

ifconfig can0
can0      Link encap:UNSPEC  HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
          UP RUNNING NOARP  MTU:16  Metric:1
          RX packets:263101607 errors:5606 dropped:20831 overruns:0 frame:5606
          TX packets:4 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:10
          RX bytes:2104812856 (2.1 GB)  TX bytes:32 (32.0 B)
          Interrupt:171

dmesg -w
[145515.707334] mttcan c310000.mttcan can0: mttcan_poll_ir: some msgs lost on in Q0
[150837.688186] mttcan c310000.mttcan can0: mttcan_poll_ir: some msgs lost on in Q0
[150837.695844] mttcan c310000.mttcan can0: mttcan_poll_ir: some msgs lost on in Q0
[154987.494396] mttcan c310000.mttcan can0: mttcan_poll_ir: some msgs lost on in Q0
[156084.524358] mttcan c310000.mttcan can0: mttcan_poll_ir: some msgs lost on in Q0
[156084.532039] mttcan c310000.mttcan can0: mttcan_poll_ir: some msgs lost on in Q0

The different configurations include:

  • 2 devices connected to 1 CAN port, each device sends ~1400messages/sec
  • 1 device connected to 1 CAN port, device sends ~1400messages/sec 1 USB camera streaming 640x380 color and depth frames at 15 frames/sec

The messages are small so we are not hitting the 1Mbps limit.

Is there a max message rate per-port which the CAN controller can handle?
From the Nvidia Parker Series SOC TRM section 35.4, it states that the CAN controller can “Sustain average interrupt of around 2000 messages/s on average of 500 μs/message, with a peak of 125 μs/message (8000 messages/s @ 1Mbps)”. Is this correct? If so, I’d like to confirm if this max message rate is per-port. Can you also provide insight on what is capping the max message rate and is there’s a way to increase it?

One more related question. Are all hardware interrupts serviced by CPU0? I’m seeing a interrupt per CAN message (from /proc/interrupts). This translates to 1400 interrupts/sec for each radar which would put more load CPU0. When I enable streaming on more USB cameras, I see CPU0 intermittently spike to 75% utilization servicing kernel processes in htop.

You might want to test in performance mode:

sudo nvpmodel -m 0
sudo ~nvidia/jetson_clocks.sh

Yes, most of the external I/O hardware must go through CPU0. If you “cat /proc/interrupts” you’ll see the current distribution of where hardware IRQs are serviced. In the case of multiple controllers of a given type you’ll find the right hand column names this as “<base_address>.”. As an example of i2c you might find “3160000.i2c”. If you browse this column and find a specific base address or description you can “watch” this during operation and see the IRQ counts go up. An example for seeing all i2c:

watch -n 1 `grep 'i2c' /proc/interrupts`

@linuxdev - thanks for the info regarding the interrupts. I confirmed that performance mode is enabled and do see CAN related IRQ in /proc/interrupts. Number of interrupts from the CAN bus match closely w/ the number of messages reported in ifconfig.

➜  nvidia@MYTX2 ~ sudo nvpmodel -q
NV Power Mode: MAXN
0

➜  nvidia@MYTX2 ~ sudo ./jetson_clocks.sh --show
SOC family:tegra186  Machine:quill
Online CPUs: 0-5
CPU Cluster Switching: Disabled
cpu0: Gonvernor=schedutil MinFreq=345600 MaxFreq=2035200 CurrentFreq=345600
cpu1: Gonvernor=schedutil MinFreq=345600 MaxFreq=2035200 CurrentFreq=345600
cpu2: Gonvernor=schedutil MinFreq=345600 MaxFreq=2035200 CurrentFreq=345600
cpu3: Gonvernor=schedutil MinFreq=345600 MaxFreq=2035200 CurrentFreq=345600
cpu4: Gonvernor=schedutil MinFreq=345600 MaxFreq=2035200 CurrentFreq=345600
cpu5: Gonvernor=schedutil MinFreq=345600 MaxFreq=2035200 CurrentFreq=345600
GPU MinFreq=140250000 MaxFreq=1300500000 CurrentFreq=140250000
EMC MinFreq=40800000 MaxFreq=1866000000 CurrentFreq=665600000 FreqOverride=0
Fan: speed=0

➜  nvidia@MYTX2 ~ ifconfig can0; grep can /proc/interrupts; sleep 10; ifconfig can0; grep can /proc/interrupts; 
can0      Link encap:UNSPEC  HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00  
          UP RUNNING NOARP  MTU:16  Metric:1
          RX packets:2268800 errors:1133 dropped:2122671 overruns:0 frame:1133
          TX packets:1 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:10 
          RX bytes:18150400 (18.1 MB)  TX bytes:8 (8.0 B)
          Interrupt:171 

427:      27754    2232944          0          0          0          0     GICv2  72 Level     can0
428:          0          0          0          0          0          0     GICv2  74 Level     can1


can0      Link encap:UNSPEC  HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00  
          UP RUNNING NOARP  MTU:16  Metric:1
          RX packets:2283372 errors:1133 dropped:2122671 overruns:0 frame:1133
          TX packets:1 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:10 
          RX bytes:18266976 (18.2 MB)  TX bytes:8 (8.0 B)
          Interrupt:171 

427:      27754    2247500          0          0          0          0     GICv2  72 Level     can0
428:          0          0          0          0          0          0     GICv2  74 Level     can1

RX packets over 10sec is 14572.
CAN0 interrupts over 10sec is 14556.

Interrupt per-message seems like a lot of overhead and am curious if there’s any improvements that can be made to the mttcan driver.

I’m still looking for insight on the CAN controller performance.

I don’t know about CAN performance tweaks. IP protocol would have MTU/MRU adjustments, but aggregation would increase latency. Perhaps CAN has some control over the equivalent of MRU/MTU you can tweak to match a larger amount of data. Beware that if you do this a packet may not send until a second packet is ready or a timeout occurs.

Btw, I think showing more IRQs than packets sent might be a demonstration of more than one packet being available at the moment the bus actually sends, but I’m just guessing (it’d be a case of two packets being ready before the IRQ is serviced…the above would be a case of not sending even if there is an IRQ unless enough data is ready to send).

I do see “errors:1133”. Perhaps that is due to not having it connected to something at the other end acknowledging…don’t know.

Just noticed: Your end only sent one packet, all the other packets were received, so it tends to mean something else was on the bus.

Hi, TX2 has dual CAN bus controllers and the 1 Mbit/s is the limit of each controller, it is also the limit defined by ISO 11898. The speed rate will depend very much on the length of the bus and components on the bus.

@Trumany, I verified that we are well under the 1Mbps limit. I have 1 device on CAN0 which sends messages using the base frame format (not extended / CAN_FD).

The device sends ~1400messages/sec where the data payload is 8bytes. This translates to a data rate (not including can id+len) of ~92kbps. I’ve included stats below to confirm rate.

bits per sec calculation : (20823768-20132632)*8/(60) = 92151bps

➜ nvidia@MYTX2 ~ ifconfig can0; sleep 60; ifconfig can0
can0 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
UP RUNNING NOARP MTU:16 Metric:1
RX packets:2516579 errors:749 dropped:400090 overruns:0 frame:749
TX packets:1 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:10
RX bytes:20132632 (20.1 MB) TX bytes:8 (8.0 B)
Interrupt:171

can0 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
UP RUNNING NOARP MTU:16 Metric:1
RX packets:2602971 errors:779 dropped:400090 overruns:0 frame:779
TX packets:1 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:10
RX bytes:20823768 (20.8 MB) TX bytes:8 (8.0 B)
Interrupt:171

To add to my original post, I only see CAN messages lost when I enable other devices (ie: USB cameras). If I only have the CAN device enabled, I do not see any messages lost otherwise.

From the block diagram in section 35.1 of the Parker SOC technical reference manual, I see there’s a 8K RAM where the messages are dumped into. The mttcan message lost messages are coming from mttcan driver rx_handler code, which makes me think we’re unable to pull the message out of the rx fifo (into the 8K RAM) when I have the USB cameras enabled.

Each controller has a separate message RAM(4K) to store incoming/outgoing messages, TX Messages’ timestamp, Filters to be applied on incoming Messages.
Hence the statement, “Sustain average interrupt of around 2000 messages/s on average of 500 μs/message, with a peak of 125 μs/message (8000 messages/s @ 1Mbps)” is applicable for both the controllers.

Regarding the hardware interrupts serving by CPU0 alone can be minimized by setting CPU affinity to distribute the load across the cores for load-balancing.

The cable reach might be something plays a factor in the throughput. Did you try different ones?

can you try moving the CAN interrupt to another CPU and the run the test
For example:
echo 2 > /proc/irq//smp_affinity

@Trumany / bbasu - The suggestion on setting the CPU affinity is helpful. When I pin CAN0 to a different than USB, I don’t see lost CAN messages anymore. Thanks!

While this appears to mitigate the issue, I want a better understanding of the data flow and how the CAN messages are serviced.
I’m missing how the “Sustain average interrupt of ~2k messages” is derived given that the controller has a 4K RAM. Can you provide more details on when/how the message are pulled from mttcan rx_handler block into the 4K RAM? Is there a way to see how full the memory is?

Hi,

Below answer can be considered to have an understanding on “Can you provide more details on when/how the message are pulled from mttcan rx_handler block into the 4K RAM?”

All functions concerning the handling of messages are implemented by the Rx Handler and the Tx Handler.
Basically you are interested in the functionalities of a Rx handler.

The Rx Handler manages:
• Message acceptance filtering
• Transfer of received messages from the CAN Core to the Message RAM
• Providing receive message status information.

M_TTCAN offers the possibility to configure two sets of acceptance filters, one for standard identifiers and one for extended identifiers. These filters can be assigned to an Rx Buffer or to Rx FIFO 0,1. Acceptance filtering stops at the first matching element.

Depending on the configuration of the filter element a match triggers one of the following actions:
• Store received frame in FIFO 0 or FIFO 1
• Store received frame in Rx Buffer
• Store received frame in Rx Buffer and generate pulse at filter event pin
• Reject received frame
• Set High Priority Message interrupt flag
• Set High Priority Message interrupt flag and store received frame in FIFO 0 or FIFO 1

Acceptance filtering is started after the complete identifier has been received. After acceptance filtering has completed, and if a matching Rx Buffer or Rx FIFO has been found, the Message Handler starts writing the received message data in portions of 32 bit to the matching Rx Buffer or Rx FIFO.

Thanks & Regards,
Sandipan

Hi Sandipan,

Thanks for the detailed response.
The “Rx Buffer” mentioned in your response does that the same as the 4K RAM? If so, does that mean when the driver is configured to store received frames in the FIFOs that the RAM is unused? And, if I switch to using the 4K RAM would the message handling rate per controller still be 2000messages/sec?

Thanks again!
-lisa

Hi Lisa,

Message RAM of size 4K which is connected to the M_TTCAN module, is to configure for storage of Rx/Tx messages and for storage of the filter configuration.
RX buffers and FIFO-0&1 which you are concerned about, are all sections of message RAM.

Few details about Message RAM can be listed as below:

  1. The Message RAM has a width of 32 bits.
  2. The M_TTCAN module can be configured to allocate up to 4480 words in the Message RAM.
  3. Sections of Message RAM: 11-bit Filter (0-128 elements / 0-128 words) 29-bit Filter (0-64 elements / 0-128 words) Rx FIFO 0 (0-64 elements / 0-1152 words) Rx FIFO 1 (0-64 elements / 0-1152 words) Rx Buffer (0-64 elements / 0-1152 words) Tx Event FIFO (0-32 elements / 0-64 word) Tx Buffers (0-32 elements / 0-576 words) Trigger Memory(0-64 elements / 0-128 words)
  4. It is not necessary to configure each of the sections listed above, nor is there any restriction with respect to the sequence of the sections.
  5. When operated in CAN FD mode the required Message RAM size strongly depends on the element size configured for Rx FIFO0, Rx FIFO1, Rx Buffers, and Tx Buffers.

Thanks & Regards,
Sandipan