Isolate a Core on Jetson (TX2)

Using Jetson TX2:

  1. Isolated core 1 on a Jetson TX2 by editing /boot/extlinux/extlinux.conf. The content of my file is as follow:

TIMEOUT 30
DEFAULT primary

MENU TITLE L4T boot options

LABEL primary
MENU LABEL primary kernel
LINUX /boot/Image
INITRD /boot/initrd
APPEND ${cbootargs} quiet root=/dev/mmcblk0p1 rw rootwait rootfstype=ext4 console=ttyS0,115200n8 console=tty0 fbcon=map:0 net.ifnames=0 isolcpus=1

  1. Reboot the system. The content of cat /proc/cmdline is ‘isolcpus=1,2’. I run the following two lines:
    taskset -c 1 some_application
    taskset -c 2 some_application

Receive error messages:
taskset: failed to set pid ???'s affinity: invalid argument

Hi,
Please refer to the example in

Jetson/L4T/r32.4.x patches - eLinux.org
[TX2] Denver cores not working on TX2

I will also ask if your power model has those cores online? Take a look at this directory listing:
/sys/devices/system/cpu/cpu*/

Each should have a file within it for “online”. Make sure the cores are online. It has been a long time since looking at this, but so far as I know, a CPU which is isolated (in this case the two Denver cores) is still accessible, but the scheduler does not put anything there by default (basically, I think that is what you are looking for).

Also, at least for testing, I suggest removing the “quiet” from extlinux.conf.

Thank you for detailed and pristine response. Are there additional ways to optimize running an application on TX2. I am conducting data collection from CPU1 at the moment. I have done the followings:

  1. The CPU1 is isolated and I use ‘taskset -c 1 <my_app>’ to run my application.
  2. Reduced the number of IRQs on CPU1 to the minimal three using ‘/proc/irq/<IRQ_NUMBER>/smp_affinity’.
  3. Installed the Jetson’s RT Kernel and set the clock to 1000 (Hz).

Additional information is appreciated.

This depends on a lot of things. Note that the RT kernel also likely has different instructions than the stock kernel. It is impossible to answer without knowing a lot more details. Much depends on which hardware is involved. For example, see:
https://forums.developer.nvidia.com/t/network-packet-loss-for-xavier-nx-vs-orin-nx/253466/2

I will emphasize that the many hardware components can be told to schedule on any core, but if the wiring cannot reach that core, then the scheduler will have to reschedule when told to service a hardware IRQ to some ineligible core; it will look like it ran on that core, but in reality the “run” is just the process rescheduling to CPU0. Software interrupts don’t have that problem. Some hardware can have drivers sent to another core, but there are a lot of limitations for this on Jetsons.

We know nothing about which parts of the Jetson are required for your data collection, how it is stored, so on. An example might be that if the data comes from GPIO the answer will differ from if the data comes from a PCIe add-on card or Ethernet since hardware IRQs handling might differ among them.

Incidentally, many people have an incorrect idea about what a real time kernel does. In the case of a desktop PC or a Jetson this is merely “soft” real time, not “hard” real time. The hardware itself is incapable of hard real time. Once you get past this, consider that all this does is give priority of one process over another, and if there is insufficient time, then the lower priority starts losing CPU time. In a non-real time kernel the scheduler tries to be a bit more fair; there are still priorities, but there tend to not be ways to force some other process to not eventually get time on the CPU core.

So far as I know the default soft IRQ rate has been 1000 Hz for many years.

Thank you for the detailed comments. I have a Nucleo board/freeRTOS transmitting data on UART and the Jetson TX2 receiving the data on it’s UART. Borrowed some C/C++ code from man-page to read the data and print it to screen using Ubuntu’s serial-port libraries. The freeRTOS on Nucleo Board uses a period thread (task). TX2 is reading the same data and processing it using the Denver’s (CPU1). I have the TX2’s kernel compiled and linked with the RT Kernel option at (1000 Hz). Used CPU1 isolation, disabled some kernel’s and niceness features to optimism the CPU1 performance. Not sure what tadditional steps could used to optimize the code.

What speed is the UART running at? As long as it is no more than 115200, then it would be likely that something unusual would have to occur before the Jetson would fall behind. The clock for the UART in the Jetson is slightly off though, compared to what it should be at, if we are talking about speeds above 115200. There are other unrelated reasons than just the UART why speeds above that might be a bit spotty. If you use the CTS/RTS flow control, then you can tend to have success at higher speeds because the Jetson can halt the flow and process data and then restart it when under load.

The UART itself I think can only run on CPU0, but I could be wrong about that. The UART is available during boot prior to Linux ever loading, and only CPU0 runs during boot. There is a lot of hardware on the Jetson which cannot migrate to another core, and despite not being certain of this, I think the UART is one such device which cannot migrate. If you were to tell it to run on CPU1 (the first of two Denver cores), then under “/proc/interrupts” I think every invocation of that UART would increment the reschedule counter as the scheduler puts this on CPU0. If that is the case, then all it would do to put this on CPU1 is to increase latency.

Note that all software interrupts can migrate to any core. Those IRQs do not involve hardware for their own code (a software driver can of course call a hardware driver, e.g., a software process might indirectly need disk reads). Many drivers which work with hardware have multiple functions, but not all of the function requires hardware access. As a contrived example, imagine you have an Ethernet driver, and that there is a checksum going on. In that case it would be considered good design to separate the work into a driver for the hardware IRQ which is atomic and locks in the CPU core only for the least amount of required time to transfer network data; following that, then the exit of the hardware IRQ could invoke a software IRQ to run the checksum.

In that example case most often you will find the hardware and software IRQ actually get scheduled on a single core. The scheduler is aware of the cache, along with the possibility of getting a cache hit or miss. On a true hard RT system you won’t have cache. On a Jetson the scheduler is aware of cache; I don’t know how the RT kernel changes things, but you can be certain that cache still gets in the way of hard realtime, and that much of the scheduling is simply so that you can guarantee higher priority processes run within a certain time at the cost of a lower priority process. This does not mean latency can be predicted on a Jetson, but there is a second reason I mention this: If the hardware must run on CPU0, but if there are then a chain of software interrupts, then there might be an advantage to scheduling the software IRQs to go to the Denver core. You would have to beware though that sometimes performance for this can end up not being very good because you are more or less guaranteed to get a lot of cache misses. When the hardware and software IRQs run on the same core, and the two occur in sequence, then most likely you will get a lot of cache hits.

If it turns out that you are seeing a lot of reschedules in “/proc/interrupts”, and any of this is related to the Denver core, then you might consider that part of the work is being rescheduled to CPU0 when it is requiring that physical wiring to CPU0. If that is the case, and if you can either create a situation of two threads working together whereby the hardware access can be purposely bound to CPU0 (it would go there anyway; this prevents wasted time migrating), and the other part can be purposely bound to CPU1, then you probably do have an advantage. If the mechanism of working between the two cores does not cause too much cache thrashing, then your timing will be significantly improved at the cost of somewhat less “average” throughput.

Can you profile (even if it is just guessing) parts of your program which mandate hardware access (the UART) from the software processing? Then use interprocess communications, or threading with something that shares memory, to run the rest of this? This would reduce CPU0 load, and probably the biggest timing issue on Jetsons is IRQ starvation on CPU0 due to the load which must be on that specific core.

Incidentally, just as an experiment, watch the software IRQ handling load (ksoftirqd):
watch -n 0.5 "ps aux | egrep '(ksoftirq|CPU)' | egrep -v '(grep|watch|ps)'"

Then run your application after observing soft IRQ load. Does the soft IRQ load change when you run your application under its heaviest load? You could do something similar watching reschedules and any hardware involved in your processing. See how that changes when going from your program not running to when your software does run. Then decide if there are parts of your software which could run in a separate process, and move that software part to CPU1 while leaving the rest on CPU0.

Thank you for the detailed feedback.

I have already utilized The J21 and J17 UARTs. Looking for a third UART that behaves like J17.

I have code to utilize these two:
J21 UART: serial@3100000: HSUART : /dev/ttyTHS1
J17 UART: serial@c280000: HSUART : /dev/ttyTHS2

Command line ‘ls -l /dev/ttyTHS*’ shows the ‘ttyTHS3’. Looking for the pin layout to test ‘ttyTHS3’
crw-rw---- 1 root 10:48 /dev/ttyTHS1
crw-rw---- 1 root 10:48 /dev/ttyTHS2
crw-rw---- 1 root 10:48 /dev/ttyTHS3

-Regards,

I don’t know the specific UARTs which are available, you’d have to check the reference docs for that. I do see something I wonder about though: The ownership of the device special files. I see “root”, but that is just owner, not group. If you run “ls -l /dev/ttyTHS*”, what do you see? Those with group “dialout” are available for your use. If the group is “tty”, then this is a serial console and already in use.

Do note that some ttyTHS* are the same hardware as some of the ttyS*. The THS* are “Tegra High Speed” drivers (which use DMA), while the S* or the old style legacy driver. Both can run on a single UART, but you wouldn’t want to use both drivers at the same time. I see you are only using the THS*, which is good, but do beware that users of the S* might interfere with a THS* if they work on the same physical UART.

How much distance do you need for cabling between the Jetson and the other device? If there is any significant length, then you might consider adding a USB serial UART which would work over longer distances and possibly be a better choice when using higher speeds above 115200.

I see the following:
$ls -l /dev/ttyTHS*
crw-rw---- 1 root dialout 238, 1 Jun 25 14:40 /dev/ttyTHS1
crw-rw---- 1 root dialout 238, 2 Jun 25 14:48 /dev/ttyTHS2
crw-rw---- 1 root dialout 238, 3 Jun 24 17:00 /dev/ttyTHS3

Which means the ‘ttyTHS3’ is available. Reading through the documentation to find the pin assignment for ttyTHS3. For example, the pin assignment for J21 and J17 are as follow:

The connector has the following pin assignments:
J17_1: Ground (GND) → Black Cable is (GND)
J17_2: Request To Send (RTS) → ignore
J17_3: No Connect (NC) → ignore
J17_4: Receive Data (RXD) → Green Cable is (TX)
J17_5: Transmit Data (TXD) → White cable is (RX)
J17_6: Clear To Send (CTS_L) → ignore

White cable is (RX) → Jetson TX2 J21 (Pin 8)
Green cable is (TX) → Jetson TX2 J21 (Pin 10)
Black Cable is (GND)-> Jetson TX2 J21 (Pin 9)
Red Cable is (Power)-> Do not connect

Can you point me to the proper documentation for ttyTHS3’s pin assignment?

Many Thanks.

Yes, that “ls -l” says you have three UARTs which you can use. I couldn’t tell you though about the pin assignments. Probably @DaneLLL can add that to the thread for the ttyTHS3 pins. It is possible though that this is routed to an m.2 connector or other location which isn’t the expected pins.

This is a new idea to resolve my issue. Not sure if this works.

Is there a procedure to adjust the UART on J21 to be like UART on J17? I have used the UART on J21 for general debug purposes. Used a USB to TTL serial connection along with PuTTy login in the TX2 and used this like a terminal. It worked well. I connected a Nucleo-64 board and sent periodic data and it did not work.

I wired my Nucle-64 board to UART on J17 and was able to send period data back and forth. I need two UARTs on the TX2 for general data communication. J17 is already there. J21 fails at this task.

Hi,
J17 is for internal usage and you may not use it. Please check user guide of developer kit:

https://developer.nvidia.com/embedded/dlc/jetson_tx2_developer_kit_user_guide

The documentation indicates UART3 is on the M.2 Key E Connector. This connector is used for the followings:
▪ PCIe x1 Lane, SDIO (Jetson TX1 only), USB 2.0
▪ I2S, UART, I2C, Control

Reading through the JetsonTX1_TX2 Developer Kit Carrier Board Specification, documentation. It presents the UART to be Pin 22 from [J18] which is a M.2 Key E Connectivity Socket (75-pin). I do not have M.2 Key E connector experience. What is the proper adapter to acquire to plugin the M.2 Key E connector? The picture looks a bit different.

Best Wishes,

This is probably categorized as the proverbial "bad idea"™. There may be something else already using this device. The lack of the device being group tty means it is not a serial console, and the group dialout typically means the end user can use it. However, in the case of something used with hot plug devices (PCI and other protocols going to the m.2 are typically plug-n-play/hot-pluggable), you might not see that until the actual plug-in event. I’m not really sure what would happen if you use this like that; perhaps you could, but then you’d likely need a device tree modification to disable everything else that might use that at the moment of a device plug-in. You could try it, but there might be unexpected consequences.

Note: A USB serial UART tends to be reliable, and often you can get them to work at a higher speed (reliably) versus the integrated UARTs (which need CTS/RTS flow control and shorter bursts due to a clock speed not being quite correct for those higher data rates).