Maximum USB performance (interrupts, isolated CPUs ...)

I am using my TK1 board to readout an USB3 device as fast as possible. Therefore i pushed my readout process to an isolated CPU (#3) to guarantee a core for its own. Does this setup interfere with the fact that only CPU0 can handle interrupts? Would it be better to bind this process to CPU0 only? Can the CPU0 be isolated anyway?

Best regards

Keep in mind that some parts of what you are doing are probably from the USB driver, while other parts do not use the driver. The driver itself for USB, when it receives a hardware interrupt, must run on CPU0…this can’t be changed. Other parts of the problem may run in a different thread and be purely a software issue without a hardware IRQ being involved. For the latter software-only IRQ triggering processing no particular core is required…you are free to float around cores at each software IRQ, or to tell it to stick to one core (CPU affinity).

For those parts which are software-only (e.g., user space only, or a kernel function which is not a hardware driver) the scheduler tends to decide which core to run on. There is some awareness in the scheduler of cache, and so it may not randomly switch to a different core each time something processes even with no affinity setting, which is a good thing since it takes advantage of caching (but competition with other threads means it might still migrate across cores if there is no affinity setting). You can indeed use CPU affinity and set your user space application (or non-hardware-driver kernel function) to run on just that core, and if you do, perhaps setting a higher process priority will improve things even more for that process. Provided CPU0 is not interfered with (affinity set to a core other than CPU0), and provided there are still multiple other cores for user space apps beyond your particular process, you could set a much higher priority on your user space app without making the system unstable.

You will still end up with cache hits and cache misses, but with affinity and higher priority the cache will tend to get its maximum benefit (other processes will not get to run as often, but they have a couple of other cores they can go to when pressure is not high, so their loss in performance may not even be noticable). A rather important detail is that CPU0 would never be interfered with by the non-hardware-driver part of what you are doing.

The dividing line in optimizing is the minimization of how much the driver has to do upon a hardware IRQ (meaning time it uses CPU0 and excludes other hardware drivers from running), versus what can be completed elsewhere. The driver could in fact do all kinds of things for you, but this would cause the driver to require CPU0 for longer times…this would be bad and wasteful of CPU0. Your “readout process”, beyond what data the USB hardware IRQ reads to a buffer which is purely part of USB, is unlikely to need to be present in the USB driver servicing the physical USB controller.

So all of this leads to the question of the nature of what has to be done with your process, and how much of the work load is from servicing a piece of hardware, versus how much is doing other non-hardware computation. What hardware is involved in your reading? Do you have kernel features used beyond the USB driver? Does all else run in a common and ordinary software program running in user space?

My USB3 device is a SDR (software defined radio) which is used to receive HF signals. The communication is done via a library (UHD) that uses libusb1.0. I am not entirely sure that i take advantage of all kernel features because right now i am running a precompiled linux kernel based on L4T 21.4 (maybe with extensions from Toradex).
The main issue is to get the data as fast as possible to avoid overflows. So my application runs an own thread (which i called process before) on a dedicated core to just receive the data. The data is get by a library call. The other parts of my user space software are not really time consiming (and not time critical).

For your situation binding the application to a different core is probably a good approach if it uses a single thread (you could also experiment with running the program at higher priorities, e.g., a nice level of -1 or -2…if it has its own core, maybe even nice of -5…see “man nice” or “man renice”).

USB drivers do their own thing anyway on CPU0. The question then is if non-USB code data acquisition/consumption can be given priority on the one other core. Perhaps there are other things going on in other threads where those other threads could be going to other CPU cores, yet don’t involve data acquisition directly (e.g., GUI). Should all those threads (assuming it has many threads) be required to use a single core, then they are essentially serialized again and not really taking advantage of cores threads could take advantage of. So your basic approach is probably good if the design of the program does not have other weaknesses introduced by everything on one core.

See: “man pthread_attr_setaffinity_np”. You may want to get the source code and see if it runs the data acquisition in its own thread; if so, then you can set affinity of just that thread and still have the rest of the process distributed on other cores by the scheduler so those other non-critical threads take advantage of other cores.