jetson.utils.videoSource.Capture() latency issue

I’m having (quite serieus) latency issues in the Capture() method of the videoSource object. I’ve opened an issue about it on Github but there was no response (yet). I’m sorry if I come across impatient but I’d like to draw some attention to it.

Is there anyone who might have an answer to this? Am I the only one experiencing this issue?

In short: whether sourcing from RTSP or local file (on SSD), the Capture() method takes between 80 and 130ms to complete which makes it impossible to get anything close to 25 fps for video. Gstreamer (with the same pipeline) on the command line works fine.

Any help would be greatly appreciated.

Hi @willemvdkletersteeg, please see my reply here on GitHub:

Thank you, Dusty. I have also tried to discuss the latency/performance issue a few months ago (in february I think it was) and the response back then was also to look into Deepstream. It was too complicated to rebuild our entire videocapture and inferencing loop, including the processing of the metadata. Deepstream just works way different in that regard.

I think, now that our application that we’ve worked on for months is now practically useless because of this one single latency issue we have no choice but to go the rebuilding route… But before we start from scratch I’d like to at least try to keep as much as I can intact of what we already have.

I have now built a drop-in replacement of the jetson.utils.videoSource object that builds a Deepstream/Gstreamer pipeline in the constructor and offers a .Capture() method, just like jetson.utils, to “capture” the next frame. This works and also seems to offer great performance. But, I use an appsink for this and that returns a GstSample/GstBuffer object in Python. I would like to somehow feed this into the jetson.inference.Detectnet detector that we have (and need, because our entire application “feeds” onto the output of this detector).

Is there a somewhat easy way to encapsulate GstBuffer data into a cudaImage object in Python that the detector accepts? Can I use an existing class/constructor for this somewhere? Otherwise I’ll have to build my own, I suppose.

No there isn’t, the memory essentially needs copied from GstBuffer into cudaImage object (which is currently done in gstDecoder.cpp / gstCamera.cpp). It seems likely this is where the additional latency is coming from, because it’s a CPU memcpy()

What I’ve been working on is using NVMM memory, so that the CPU doesn’t need to perform any memcpy. So far it seems to show promising results in terms of utilization and latency. I should have a code update checked in within a day or two, and will let you know to try it out.

That’s a bummer. I have the Gst pipeline working with NVMM memory and it’s really fast (130fps with an HD h264 file). It’s just that I need to “cast” the GstBuffer data into a CudaImage capsule to do anything useful with it. Still trying to figure this out.

Good to hear that you’re also working on a solution. That would be even better! (for other users as well) Python lacks a way of direct memory handling so I suppose a solution in the shared library itself is potentially more performant.

I had also been digging a little further and the culprit is this line:
gstDecoder.ccp:

	// wait until a new frame is recieved
	if( !mWaitEvent.Wait(timeout) )
		return false;

Most lines in the capture() function take a few nanoseconds to complete, but this mWaitEvent.Wait() function takes anywhere from 80.000 to 120.000 ns to return, which is ages. But you’ve probably figured this out as well by now ;) haha

I have hit a dead end with my solution.

As far as I can tell there is no way to use the appsink with NVMM memory as the reported size of the GstBuffer (in the GstSample) the appsink produces is way too small. It’s probably garbage, anyway.

If I remove the “(memory: NVMM)” from the pipeline:

filesrc location={input_uri[5:]} ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv ! video/x-raw,format=RGBA ! appsink

I can grab samples perfectly well. I can transform them to a numpy array and then put that array into jetson.utils.cudaFromNumpy():

np_array = numpy.ndarray(
            (caps.get_structure(0).get_value('height'),
            caps.get_structure(0).get_value('width'),
            4),
            buffer=buf.extract_dup(0, buf.get_size()),
            dtype=numpy.uint8)
return jetson.utils.cudaFromNumpy(np_array)

It works, but it totally defeats the purpose as my entire pipeline is in NVMM memory, then the buffer needs to get copied to CPU memory for the appsink and transform to numpy, and then back to GPU memory again for the inferencing. Totally inefficient, not a viable solution.

I hope your solution works! Can’t wait to find out.

That’s because that GstBuffer is simply a descriptor of the NVMM memory handle, and the nvbuf_utils / EGL APIs need to be used to map it into CUDA. These are C/C++ APIs so not accessible from Python - which isn’t a problem for me, because my underlying implementation is in C++ (which gets exposed via my Python extension modules)

Anyhow, I’ve commited the initial changes to gstDecoder here:

I recommend that you re-clone/re-build from source. These changes are only for gstDecoder currently (so receiving RTP/RTSP and reading video files). I am going to refactor this so I can make it work with gstCamera as well. But it will be good to know if it works/improves things for you. I don’t think you should need to make any modification to the command line. After re-building, you should see the pipeline printed out now specifies to use NVMM memory.