No ARM Cortex-A architecture is capable of hard real time. You have to get ARM Cortex-M or Cortex-R. For anything beyond simple situations you’d need Cortex-R.
Consider why a simple microcontroller (Cortex-M) handles only a few processes. The time it takes to schedule and context switch itself goes up rapidly depending on the number of processes. The Cortex-R still has this issue, but it also has hardware capable of assisting that scheduling and works to more complex loads.
Cache is an enemy. You won’t have this is Cortex-M or Cortex-R in any location where timing must meet some spec. Sometimes people will run supposedly real time on a desktop PC, but this is usually wrong. In the cases where this might succeed the cache is disabled. Jetsons have cache, and mostly would suffer greatly in average speed if you disabled cache (and I think not all of the Cortex-A cache of a Jetson can be disabled). This though helps you reach the 90 uS if cache is enabled…at least when there is a cache hit. This is greatly harmed in the case of a cache miss because now the cache has to be refilled before it can be used.
You won’t normally be able to get to it, but Jetsons do have an Audio Processing Engine (APE) and Image Signal Processor (ISP) which are each one of the ARM Cortex-R series (I think a ARM Cortex-R5 or a minor variant). These do operate in hard real time. There are ways to get to the APE if you don’t mind losing audio, but you pretty much cannot access the ISP for your own use.
The above is just about architecture of the CPU itself. Something else you must consider is that Jetsons greatly depend on hardware load. A piece of hardware generates a hardware interrupt (hardware IRQ) to tell the scheduler it wants time. For hardware this involves an actual wire able to reach a CPU core. For a software interrupt (soft IRQ) there is no wire, and mostly the scheduler can do what it wants so far as choosing a CPU core. This also indirectly influences whether you get a cache hit or miss.
On the Intel CPU desktop PC there is an IO-APIC (a programmable I/O interrupt controller) which can map any of the hardware’s IRQ wires to go to any CPU core. The scheduler might not spread things out just because it knows about cache hits and misses, but the option is there (and when cache is turned off it gets easier). AMD has something similar so far as being about to route hardware IRQ to different cores. Jetsons do not have this. If the hard wire does not exist to a core, then the IRQ cannot go to that core. You can tell it to do so, but when the scheduler sees this cannot be accomplished it will migrate the IRQ back to a core which can handle the interrupt. Sadly, this is only the first core (CPU0).
Take a less at “less /proc/interrupts”. This is the hardware IRQ count. Some stats towards the bottom might even mention a core getting rescheduled (such as changing it from some non-CPU0 core to CPU0). Take particular note that every core has its own timer, but other than this, you won’t see much going to these non-CPU0 cores. When you do see something there, and if it is not a timer, there is a significant chance that in the statistics at the bottom you’ll also see reschedules. A desktop PC architecture probably won’t spread IRQs out by default, but if you did so, then your process would normally run on the core you specify with the RT kernel.
Some hardware can in fact go to different cores. Much of it cannot. Let’s say that your network controller requires CPU0. Also, the disk controller and a few other things require CPU0. For a Jetson these can never migrate to another core. If your specific process depends on CPU0, then you are competing with all those other hardware IRQs. If your process is a software only and does not involve a hardware IRQ, then you could conceivably dedicate your process to one of the other cores and improve the situation. Then again, even a software driver depends on data, and often the data depends on hardware, so it isn’t a guarantee.
It gets rather complicated, but if you really want to know where the timing issues are, then you probably need to start by profiling the individual application with gprof or similar. If you don’t know where the time is being spent, or delayed, then it is hard to improve it. Maybe you’ll find out that most of the delays are from disk or network access, which gives a clue. It is a place to start. I don’t know of an easy way to profile the scheduler, but once you’ve worked on removing things like unnecessary delays in the program itself you’ll at least have an idea. It is hard to overestimate the value of profiling to know what exactly is taking the most time (you could then limit the profiling to what takes the most time due to hardware access and cache hit/miss to indirectly infer where latency is coming from in the hardware itself).
Someone from NVIDIA (e.g., @DaneLLL) might have more information on optimizing, but I don’t know if NVIDIA has an interest in the RT kernel since reference o/s does not use this. If you are serious about real time you might check one of the NVIDIA partners:
https://concurrent-rt.com/
Some people might put a hard real time inexpensive controller on individual pieces of hardware, and then use the Jetson to run the controllers as “fly by wire”. This means the controllers can have simple functions which are very deterministic and predictable, and simply use the Jetson to change the details.