Sorry ahead of time for the long answer…what you’re asking is actually a complicated topic and very specific to a particular situation.
With a usleep in the code (even a usleep of 0) you can’t be certain about actual processing time your code spent using CUDA/GPU versus something related to context switching (even without a usleep there will be context switching at other times). Changing something related to scheduling can help if you get rid of the usleep, though fundamentally you would still need to adjust your code to work without the usleep function. Which particular kernel polling rate parameter did you change?
About context switching and scheduling…
Every time a process swaps out active/inactive, a lot has to be saved and then later restored. For example, anything related to owner of the process, security, current registers, so on, is saved and takes time to do so. Threads are a bit more efficient, they don’t need to swap out and restore something shared among threads…for example, the thread contexts of one process will all share some process ownership and security information…because of this the amount of information to be saved and restored for thread context switch is less than that of independent processes context switching. If you look at something like a particle engine microthreads or coroutines are used instead…these are more efficient and can context switch (it’s just a branch, not a real context switch) within a single thread with negligible overhead because the author has specifically found places in the particle processing where the code can jump to a different particle without bothering to save much of the data a thread context change would have saved/restored (this is typically done in assembler by a human, not by a compiler optimizer).
CPU0, the first CPU core, is wired to deal with hardware interrupts. Hardware servicing drivers must run on CPU0. Software interrupts and user space system calls can be serviced on any core. On a desktop Intel x86 machine, there is an I/O APIC which can distribute hardware interrupt servicing over any CPU core, and the AMD desktop CPU architecture also allows hardware IRQ processing on any core…embedded systems do not do this, and are stuck with CPU0 being required for hardware IRQ. Starve access to CPU0 for hardware, and the system hardware begins to mysteriously fail.
Regarding interrupts and interrupt polling rates, effects on user space and software-only-drivers are different versus a hardware driver. Each context switch (of any kind…probably increased with increased IRQ polling rate) has overhead. More context switches mean more lost time context switching…but if the system has enough time to do what is needed in spite of the overhead, then you’ve succeeded at making the software smoother and closer to real time. If you are crossing the boundary between “enough available CPU time slice regardless of context switch overhead” and “context switching overhead is eating into code which should have run but now can’t”, then the effect of the faster IRQ polling rate is harmful. A big part of whether resources are sufficient for the overhead or not is because only one CPU core can handle hardware interrupts, while anything else can be handled by any CPU core…it takes more to starve out non-hardware-IRQ servicing than to starve out hardware-IRQ servicing…the consequences of starving non-hardware drivers is also far less than if you starve hardware-IRQ handlers and lose access to your hard drive or other critical parts of the system. Try the faster polling rate, but watch carefully to see if hardware itself continues to respond normally (e.g., something just hangs on hard drive access, or networking starts lagging and losing data).
Suppose a non-hardware IRQ hits…there is a good chance at least one of the cores is at a convenient place to context switch to another thread or process…the scheduler is free to make the system feel responsive. Something may slow down, but the system will still respond “normally”. If the user space load goes up enough that the user space programs do not respond well to interrupt inefficiencies, the system itself can continue to operate normally so long as hardware drivers have access to their required time slice on CPU0…disk drive controllers, ethernet drivers, so on, continue to feed or gather data for the software which uses it.
If a hardware driver is using CPU0, and servicing requires less time slice than the IRQ polling rate, then it’s probably a good idea to increase the IRQ polling rate…the system will become more responsive (but less power efficient, the polling increases electrical power requirements and heat output). Once polling rate exceeds required hardware driver time slice, either the driver has to be ok with being swapped out, or the driver has to be in a code section which is atomic and cannot be preempted. You might be ok with the camera getting a partial frame capture (though probably not), you definitely won’t be ok with the hard disk waiting on a driver which is in turn waiting on the hard disk. If you have hardware drivers coded correctly, and only the minimum work required in the hardware driver is being performed (e.g., retrieve data, but process it in user space instead of kernel space), then it is probably ok if hardware drivers are sometimes in an atomic code section and refuse to give up CPU0 for a short time.
All of this big long explanation basically says that having one thread to produce data from hardware, and a separate thread to consume and process data, is likely the best place to start before you measure whether or not the GPU access itself is taking too long, or if the latency issues are for other reasons. The usleep method pretty much guarantees context switching, while two or more threads working together (producer/consumer threads) have a good chance of doing what you want, e.g., one thread to do nothing more than buffer and pipe image data as it is produced, and a separate thread to consume the image data as fast as it can…no usleep would be required.