Jetson TX1 Ethernet Speed Issues - Stuck at 10Mbps

We are having a problem with a the Jetson TX1 Ethernet where the Interface starts with 1000Mbps at boot and the speed falls down to 10Mbps after we start running some load on the Jetson.

Are there any known issues with the Jetson TX1 Ethernet or are there any debug tools that we can run to resolve this issue?

We are running ethtool and/or dmesg to catch change in eth speeds.

Hi 0xd3v4,

What kind of load do you run? Could you paste the result from ethtool of your interface?

@WayneWWW: We have a camera that gets input from image sensors, sends it to Jetson over PCIE and runs CUDA based processing on it. Here is the ethtool on the camera compared with the ethtool dump on the devkit. On the camera, as you see 1000mbps is not even advertised. After running the CUDA workload, the eth speed falls down to 10Mbps.

Also, small error in my previous post, I meant to say starts with 100Mbps at boot and falls down to 10Mbps.

Camera:

Settings for eth0:
	Supported ports: [ MII ]
	Supported link modes:   10baseT/Half 10baseT/Full 
	                        100baseT/Half 100baseT/Full 
	                        1000baseT/Full 
	Supported pause frame use: No
	Supports auto-negotiation: Yes
	Advertised link modes:  10baseT/Half 10baseT/Full 
	                        100baseT/Half 100baseT/Full 
	Advertised pause frame use: Symmetric Receive-only
	Advertised auto-negotiation: Yes
	Link partner advertised link modes:  10baseT/Half 10baseT/Full 
	                                     100baseT/Half 100baseT/Full 
	Link partner advertised pause frame use: No
	Link partner advertised auto-negotiation: Yes
	Speed: 100Mb/s
	Duplex: Full
	Port: MII
	PHYAD: 32
	Transceiver: internal
	Auto-negotiation: on
	Supports Wake-on: pumbg
	Wake-on: g
	Current message level: 0x00007fff (32767)
			       drv probe link timer ifdown ifup rx_err tx_err tx_queued intr tx_done rx_status pktdata hw wol
	Link detected: yes

Dev-kit:

Settings for eth0:
	Supported ports: [ MII ]
	Supported link modes:   10baseT/Half 10baseT/Full 
	                        100baseT/Half 100baseT/Full 
	                        1000baseT/Full 
	Supported pause frame use: No
	Supports auto-negotiation: Yes
	Advertised link modes:  10baseT/Half 10baseT/Full 
	                        100baseT/Half 100baseT/Full 
	                        1000baseT/Full 
	Advertised pause frame use: Symmetric Receive-only
	Advertised auto-negotiation: Yes
	Link partner advertised link modes:  10baseT/Half 10baseT/Full 
	                                     100baseT/Half 100baseT/Full 
	                                     1000baseT/Half 1000baseT/Full 
	Link partner advertised pause frame use: No
	Link partner advertised auto-negotiation: Yes
	Speed: 1000Mb/s
	Duplex: Full
	Port: MII
	PHYAD: 32
	Transceiver: internal
	Auto-negotiation: on
	Supports Wake-on: pumbg
	Wake-on: g
	Current message level: 0x00007fff (32767)
			       drv probe link timer ifdown ifup rx_err tx_err tx_queued intr tx_done rx_status pktdata hw wol
	Link detected: yes

Sorry for that I still don’t quite understand the detail here. You connect the camera to devkit through PCIE and devkit is connected with a hub. Thus, the result of ethtool from devkit is the bandwidth between jetson and hub.

Then, what is the result of ethtool from camera? Is it the bandwidth of PCIE? or are you using a ethernet camera?

Sorry, no. The camera uses just the TX1 module and a custom board built by us connects to the Jetson TX1 Module over PCIE. So the ethtool dump names “camera” is the dump from the Jetson Module inside our camera.

The “devkit” dump is with the Jetson TX1 Module removed from the camera and placed on the dev-kit and running ethtool.

Hope this makes sense.

Is there an ethernet bridge on the custom board (is it camera->ethernet->PCIe)? Is this what ethtool sees? If not, which hardware uses ethernet?

PCIe is just used for interfacing camera sensors and unrelated to Ethernet. The Jetson module has ethernet-pins that we have traced out to an ethernet port.

Ok, so other than the custom board being the source of loading on the system, the ethernet is unrelated? It sounds like you discovered a side-effect of the custom board competing for time on CPU0 (this is the core where hardware IRQs must start…if one driver starves another driver on this core for time you have discovered IRQ starvation).

I’m not sure what you’d need to do, but basically it is a case of profiling the custom board drivers, and seeing what parts might be moved to a different core (separation of hardware I/O driver function from any software processing which a non-CPU0 core could handle).

I think the communication here was a bit unlucky. The PCIe stuff is completely unrelated to our problem. Forget everything that has been said about PCIe.
We designed a PCB, which connects the Ethernet-pins of the Jetson to an Ethernet-jack with built-in magnetics. The traces on the PCB are on top of a solid ground plane, The differential impedance is almost exactly 100 Ohm and the differential pairs are length matched and not longer than 3-4cm. But for some reason we never achieve Gibabit-speed but only 100Mb/s and sometimes even only 10Mb/s. I presume that this is a signal integrity or noise problem although the circuitry is so simple that this is hard to imagine.
The question is how we can solve this problem. Are there more advanced software debugging tools than ethtool, which tell us more details about the results of the auto-negation process? Maybe some information about noise or signal quality similar to what wifi-devices typically can deliver?

I would guess that if your traces are that short, that balanced, and impedance is well-matched, and with no hardware failure, that you will still need to profile. There is only one CPU core which can handle hardware IRQ…CPU0. Competition of other drivers for time on CPU0 means your driver cannot run (or the other driver cannot run) and speeds drop. And of course vice-versa. It seems you’ve only looked for hardware reasons, but software can do what you are seeing when drivers are not allowed to run without delay.

If there were noise, then I could imagine as noise source changes that speed would change…but those are short traces and it sounds like they are matched. Such extreme swings, and never reaching full speed, implies either the noise is severe and never goes completely away, or that noise is not the problem. Noise as the problem seems less likely when considering what you’ve said about layout of traces and impedance matching.

Note that it is possible to set the Jetson to performance mode to avoid clocks throttling back, this would be one way to get CPU0 to max performance. See:
http://elinux.org/Jetson/TX1_Controlling_Performance

I can’t tell you how, but it seems like your issue is IRQ starvation, not physical/electrical layout. You could invent something to artificially change the load on CPU0 (perhaps a file transfer to “/dev/null”) and see if things get even worse. Example:

# install htop, monitor "htop -uroot"...
sudo -s
dd if=/dev/mmcblk0 of=/dev/null bs=512
exit

Keep in mind that you can run the same dd test case multiple times simultaneously. See if you see networking slow down without your custom board as load goes up on CPU0. Or see if the effects of your card are made worse with CPU0 artificially loaded down even more. You can try to look at “/proc/interrupts” and verify if CPU0 interrupt rate goes up overall (indicating hardware IRQ use). You can also run the load with “nice” to increase its priority (I wouldn’t use more than -2 increase for testing). Example:

sudo -s
nice -n -2 dd if=/dev/mmcblk0 of=/dev/null bs=512
exit

You might get a feel for how ethernet speeds change if you change the priority of processes you think are related to the issue (htop can renice to higher priority like -2 or lower priority like +2 fairly easily…see the hot key menu at the bottom after you move the cursor up or down to the process you are interested in…you have to run htop itself with sudo to increase priority).

For reference, if you check “/proc/interrupts”, notice that some listed interrupt sources occur on any CPU, but others occur in large number only on CPU0. If you run this command it’ll give you just the processes with interrupts on CPU0:

cat interrupts | egrep -v '[:][ ]+0 '

I’m not sure if reading eMMC is actually the best way to produce hardware interrupt load, but it is an example. Perhaps reading an SD card ("/dev/mmcblk1") instead would be the best test, or with a dd block size of 1 instead of 512. The goal is to cause use of CPU0 while more or less leaving alone other CPUs. Should there be networking changes following CPU0 load, then you probably have shown the issue as not trace or hardware layout.

The most embarrassing mistake turned out to be done by me: The ethernet connector with the built-in magnetics had pin 6 moved two positions back and I didn’t recognize it.

@linuxdev: Thanks a lot for your answer, looks like we have figured out the issue on our board. Thanks!

I suppose a missing wire might slow things down :P

You might want to post in a forum for Shield, the Jetsons share hardware, but are Ubuntu instead of Android.

That said, NetworkManager (at least under Ubuntu) tends to try to manage what to do with connections when one connection is added or removed…motivated by WiFi since one may have a preferred connection of wired or wireless when both are available. My thought is that you have a NetworkManager configuration which thinks you don’t want wired when WiFi is up and it isn’t really a bug…it’s a configuration issue. Sorry, I wouldn’t know where to start, but you might check to see if your Shield has NetworkManager running, and if so, how access and configuration can be adjusted.

PS: On Jetsons I often bypass NetworkManager on wired and hand edit files. Android edit would of course be quite different.

You shouldn’t have to disable wifi in the first place.

The Shield should switch to Ethernet as soon as a cable (with internet access, that is) gets plugged in.

Did you try the same cable with another device?