Jetson tk1 took long latency for image transfer to device through opencv gpu


I’m working project with camera on jetson TK1.

I’m trying to demosaicing the raw image with gpu since it is faster than cpu.

With OpenCV gpu module, the debayer result is great ( approximate 0.5ms with gpu compare to 1ms with cpu ( 640x512 bayer image ) )

But there are random latency while upload and download to and from device.
The latency makes the gpu method have no benefits compare to cpu.

I have already maximize the performance of TK1. And try cudamem function of opencv ( ALLOC_ZEROCOPY either ALLOC_PAGE_LOCKED ).

The overall process takes long latency ( gpuImage.upload → debayer → )
It’s about 1.8ms for average, and sometimes it swift to 10ms!!!

With cudabandwidth test, I get 6000MB/s host to device and 6000MB/s device to host(pinned memory). So it should not be problem on hardware?

Where are the latency come from? Are there any better configuration?

Hi monkeykevin,

The variation has been a problem with GPU work submission.
Could you help to provide your system info, such as BSP, OpenCV4Tegra and CUDA version as reference?


Hi kayccc,

Thanks for advice. My OpenCV4Tegra is 21.4 and CUDA version is 6.5

And now i’m realizing maybe the latency problem is not only causing from cuda…

I have substituted usleep(2000) function(2ms delay) with my original image processing function

I found that sometimes it shifted to 3ms delay…( i measured with gettimeofday() )

Even I used sched_setscheduler() with priority 99 to make it real time,

I still experience latency( usleep(2000) but measured 3ms ) in the cycle while capturing…

Any suggestion for this??? I wish to done image processing between image captured( about 3ms ).

Thank you.

I don’t know specifics to help you, but one thing to consider is that any call of “sleep” or its variants allows the operating system context switching to deal with other threads or processes (not necessarily those related to your program). A sleep or usleep of “0” seems to do nothing, but it does in fact offer the kernel a chance to context switch to something else when it would not normally do so, and then eventually get back to the point of the sleep or usleep of 0. What you might be seeing is usleep time plus the operating system allowing other things to process (this is only a bad thing if the other processes are unrelated user space). As an experiment, try setting usleep to “0” and see what kind of latency you get.

Hi linuxdev,

Thank you.

I have try usleep(0), i measured latency about 1.6ms.

This is a bad news for me.

Is this causes by other process preempt my user space process…??

Any suggestion for solving this?

with adding chrt -f 99 command, I still get maximum 0.6ms delay…still bad new for me.

I have to done my processing in specific timing( about 3ms ).Although the process usually will be done in 1ms, but the “random latency” will cause it more than 3ms sometimes.

This is the code i tried…

#define LINUX

#ifdef LINUX
#include <unistd.h>
#include <stdio.h>

#include <sys/time.h>

using namespace std;
struct timeval tv1,tv2;

int main()
double t = 0;
double maxT = 0;

        t = (double)(tv2.tv_usec - tv1.tv_usec)/1000000 + (double)(tv2.tv_sec - tv1.tv_sec);
        cout << "Time measured:" << t << endl;
        if( t > maxT )
            maxT = t;
        cout << "Maximum time measured: " << maxT << endl;

return 0;


I do not know specifics for your case, but if usleep of 0 takes more time than 0, then it is because any form of sleep tells the scheduler in the kernel that your thread or process is at a convenient point to context switch and service other parts of the system. Even without the usleep, the process would eventually have to service something else and do normal things like run the video card or ethernet…and in user space, things like audio, and many other processes want to run.

The scheduler decides the rules of what runs when. Different parts of the system and user space have different priorities; the “nice” number can be used to apply pressure from a particular process to gain more access to scheduling (normal is “0”, “-1” is higher priority…you have to be careful with more priority to not have unintentional side-effects). You mentioned sched_setscheduler(), which I haven’t used before, but probably does the same thing…only no process can be set to higher priority than a “0” nice number without root authority (it’d have to be run sudo). Any kind of sleep will always context switch momentarily if there is any pressure from anything else…this is functioning as designed. If you do not want context switching, you could wait on another thread using some alternate form of synchronization, and this thread could wait without sleep until your other thread does something to indicate it has done all it can in the time you are allotting it. Unless you’re using some sort of realtime system, the scheduler is going to give time to things unrelated to your own program…and if you are using realtime and don’t do it right, it could be a disaster.

You may want to describe in more detail what it is you wanted to happen when you used sleep, and why you picked this location to sleep.


Thank you for your quick reply.

I have tried nice value before using sched_setscheduler(), it worse the latency, although the major of “delay” decrease, but it “burst” once. ( For example, it can be sleep in 0s usually, but it burst to 5ms after several loop ).

Seems like there are no solution without using realtime operating system?

I am doing image processing to find some coordinates so it can transfer to robot and do the control

So the overall process loop is like

Camera triggered(3ms) -------> image processing -------> data transfer to dsp --------> robot control

It is the best to process all image in time between camera triggered to take images…
So I wish the image processing plus data transfer processing time not exceed 3ms…

Its really a disaster for me if I cannot tune the scheduler…

This is usually something to optimize over time on a specific case-by-case basis and very hard to just answer. A very important question…why use usleep instead of a thread which blocks until the other thread completes? How many threads are you using now?

Hi linuxdev,

Actually I am using 1 thread by now, so there are no synchronization problem.

I try usleep() instead of my image processing function to do experiment ,because I found that there was “delay” after several loops with my image processing function.( And now I realize that the delay maybe comes from another process )

Due to my goal,I wish to done image processing before the next image come in, I need to done my image processing within 3ms.

But now the “random delay” is about 1~2ms, which make my image processing time exceed 3ms( usually only spent 1ms for image processing ).

If the processing time is longer than the camera triggered, the system will not be in real time.

Any suggestion to let the scheduler only concentrate on my process? Or distribute the “delay” to each loop of
image processing so not exceeding 3ms? Because now I cannot “control” the delay…

Thank you.

Hi, after doing some research on google, I decide to tune scheduling interval to higher rate, which is 1000Hz. Will this work for my application?

I will try and report.

Sorry ahead of time for the long answer…what you’re asking is actually a complicated topic and very specific to a particular situation.

With a usleep in the code (even a usleep of 0) you can’t be certain about actual processing time your code spent using CUDA/GPU versus something related to context switching (even without a usleep there will be context switching at other times). Changing something related to scheduling can help if you get rid of the usleep, though fundamentally you would still need to adjust your code to work without the usleep function. Which particular kernel polling rate parameter did you change?

About context switching and scheduling…

Every time a process swaps out active/inactive, a lot has to be saved and then later restored. For example, anything related to owner of the process, security, current registers, so on, is saved and takes time to do so. Threads are a bit more efficient, they don’t need to swap out and restore something shared among threads…for example, the thread contexts of one process will all share some process ownership and security information…because of this the amount of information to be saved and restored for thread context switch is less than that of independent processes context switching. If you look at something like a particle engine microthreads or coroutines are used instead…these are more efficient and can context switch (it’s just a branch, not a real context switch) within a single thread with negligible overhead because the author has specifically found places in the particle processing where the code can jump to a different particle without bothering to save much of the data a thread context change would have saved/restored (this is typically done in assembler by a human, not by a compiler optimizer).

CPU0, the first CPU core, is wired to deal with hardware interrupts. Hardware servicing drivers must run on CPU0. Software interrupts and user space system calls can be serviced on any core. On a desktop Intel x86 machine, there is an I/O APIC which can distribute hardware interrupt servicing over any CPU core, and the AMD desktop CPU architecture also allows hardware IRQ processing on any core…embedded systems do not do this, and are stuck with CPU0 being required for hardware IRQ. Starve access to CPU0 for hardware, and the system hardware begins to mysteriously fail.

Regarding interrupts and interrupt polling rates, effects on user space and software-only-drivers are different versus a hardware driver. Each context switch (of any kind…probably increased with increased IRQ polling rate) has overhead. More context switches mean more lost time context switching…but if the system has enough time to do what is needed in spite of the overhead, then you’ve succeeded at making the software smoother and closer to real time. If you are crossing the boundary between “enough available CPU time slice regardless of context switch overhead” and “context switching overhead is eating into code which should have run but now can’t”, then the effect of the faster IRQ polling rate is harmful. A big part of whether resources are sufficient for the overhead or not is because only one CPU core can handle hardware interrupts, while anything else can be handled by any CPU core…it takes more to starve out non-hardware-IRQ servicing than to starve out hardware-IRQ servicing…the consequences of starving non-hardware drivers is also far less than if you starve hardware-IRQ handlers and lose access to your hard drive or other critical parts of the system. Try the faster polling rate, but watch carefully to see if hardware itself continues to respond normally (e.g., something just hangs on hard drive access, or networking starts lagging and losing data).

Suppose a non-hardware IRQ hits…there is a good chance at least one of the cores is at a convenient place to context switch to another thread or process…the scheduler is free to make the system feel responsive. Something may slow down, but the system will still respond “normally”. If the user space load goes up enough that the user space programs do not respond well to interrupt inefficiencies, the system itself can continue to operate normally so long as hardware drivers have access to their required time slice on CPU0…disk drive controllers, ethernet drivers, so on, continue to feed or gather data for the software which uses it.

If a hardware driver is using CPU0, and servicing requires less time slice than the IRQ polling rate, then it’s probably a good idea to increase the IRQ polling rate…the system will become more responsive (but less power efficient, the polling increases electrical power requirements and heat output). Once polling rate exceeds required hardware driver time slice, either the driver has to be ok with being swapped out, or the driver has to be in a code section which is atomic and cannot be preempted. You might be ok with the camera getting a partial frame capture (though probably not), you definitely won’t be ok with the hard disk waiting on a driver which is in turn waiting on the hard disk. If you have hardware drivers coded correctly, and only the minimum work required in the hardware driver is being performed (e.g., retrieve data, but process it in user space instead of kernel space), then it is probably ok if hardware drivers are sometimes in an atomic code section and refuse to give up CPU0 for a short time.

All of this big long explanation basically says that having one thread to produce data from hardware, and a separate thread to consume and process data, is likely the best place to start before you measure whether or not the GPU access itself is taking too long, or if the latency issues are for other reasons. The usleep method pretty much guarantees context switching, while two or more threads working together (producer/consumer threads) have a good chance of doing what you want, e.g., one thread to do nothing more than buffer and pipe image data as it is produced, and a separate thread to consume the image data as fast as it can…no usleep would be required.

Hi, after a long time I still can’t solve this problem :(

with tracing process with Tegra System Profiler, I found the latency comes out periodicly with my cuda function
and related with these symbols


link for larger image

the latency is always about 5ms and my application is expected to finish below 3ms.

Any suggustion for solving this?

Thank you.

Hi, I think I have find out the problem :)

cudaMallocManaged() causes the spikes…but still don’t know the reason.

After i remove all cudaMallocManaged() replace with zero copy method(cudaHostAlloc), the spikes gone. ( no more irritating 5ms latency )

Very thanks to linuxdev, and everyone for helping :)