Design and architecture guidance

Hi, I am looking to do the following. I want to have the Jetson Nano accept a high resolution camera video input from the CSI input, crop a specific region of it (conforming to a specific aspect ratio), scale it to a constant resolution (conforming to the same aspect ratio), and output it, say, to a display sink.

I know how to accomplish this using a gstreamer pipeline statically.

The part that I want to change is that I want to sample and inspect some of the frames at a set rate (for example every 30th frame) using jetson-inference and adjust the settings of the crop element based on the inferences made by the inference.

What would be the best way to architect it? Would it be best to use a gstreamer pipeline to stream the source to a display sink, but then also tee the video before the crop element, and then use an appsink to feed it to the analysis portion. Or is there another structure that might work better.

In jetson-inference, there is gl-display. Probably it is better to use the implementation directly.

I am not clear what you mean. How can I use gldisplay in this context to display a fixed size window into which a specific subportion of the input video is scaled and shown in addition to the coordinates of the subportion and the associated scaling parameters being dynamically modified.

looks like I should use gldisplay render instead of renderonce for video and setviewport to set the subarea of the video to show? I noticed that jetson inference tools modify the video in place to insert the detections. What if I want to use the detections to crop the video but I don’t want any bounding boxes etc. added to the video stream?

Where can I find documentation on gldisplay?

HI @DaneLLL or @kayccc could you please help provide some guidance on the best way to go about implementing this in terms of architecture? I want to add something I forgot to mention before. Here is what I am looking to do;

Camera image -> undistort -> image sampling point for inference -> crop -> scale -> output (to display or file)

in parallel, I want to sample the undistorted image prior to crop every N frames, analyze the image (run a detector on it) and use the results from the analysis of this frame and potentially N previous analyzed frames to determine parameters for the crop element. I am hoping to keep my video 30 fps.

My questions are:

  1. What is the best way to undistort the image most efficiently? If I use opencv procedures to determine a camera undistortion matrix, what is the best way to utilize that on the Jetson? It has to be efficient because I have to undistort every image. Is the opencv undistort element in gstreamer the way to go? Or is there a better option?

  2. What is the best way to sample images after undistort but before crop with a sampling rate? i.e. perhaps sample one out of every 30 frames for analysis?

  3. How can I set the crop parameters based on the analysis?

Thanks for any help you are able to provide. If you can provide me a high level idea and point me to any documentation. I can dive deep and learn more about this and figure it out.

Also, @kayccc I don’t think the tags are right. I am not asking about board design. I am asking about software design to accomplish a specific task using the Jetson Nano dev kit.

I did some research today - let me know if I am on the right track. My two options seem to be:

  1. Write it as a gstreamer cpp application with two pipelines, and an appsink and an appsrc. The code that handles data delivery to the appsink will use a counter to run inference on every N frames. It would then send a cropped version of the image to the appsrc and this pipeline would scale it as needed. For this, I am missing any kind of guidance on writing a gstreamer application on the nano. Should I just follow the regular gstreamer application development guide?

  2. Write it using videosource and gldisplay. But I am hitting a wall of lack of documentation. For instance, I can’t even seem to figure out how to specify the input width or height or rotation to videosource or output window width or height to gldisplay. Or how to use detectNet without it overlaying a bounding box for the detections on the image. From looking at the files, it looks like it may be a true/false parameter for the last argument? I have looked through the getting started guide - it doesn’t seem to get me far enough along. Is there another place I should be looking?

Another potential method I thought of is to write my application as a GST plugin. Would it suit my needs better?

Hi @cloud9ine, DeepStream SDK has custom GStreamer plugins that you could leverage in writing your own application. If you were to write your own plugin, for performance it should use CUDA or libraries that use CUDA for image processing.

Regarding your questions about jetson-inference, you can find the documentation here:

The bottom link has examples of cropping via C++/Python that use CUDA underneath. The C++ implementation of glDisplay has more functions than the Python bindings support. In general, the image processing is done with CUDA functions, then the final image is passed to glDisplay for rendering. It also supports other outputs, like saving to compressed video or streaming via RTP.

To have detectNet not overlay bounding boxes on the video, start the detectnet/ app with --overlay=none. This passes the overlay flag to detectNet.Detect()

You can use detectNet without glDisplay - in C++ the image data is available via the data pointer, or in Python you can use these image accessors or numpy arrays:

1 Like

Hi @dusty_nv

Thank you so much for responding. I reviewed the api docs you posted, the deepstream SDK documentation, as well as the info on how to access image data in python.

The detectnet api info is clear. I can call detectnet.detect with image, width, height, and none for the last parameter to avoid overlay on the image. I can then use the detections to crop the image and use gldisplay to render it. So far so good.

The jetson utils api documentation is missing a lot of the info I am looking for (or maybe I am not looking at it correctly). For instance, how do I initialize gstCamera or videoSource to use CSI MIPI camera with a resolution of 4032 x 3040, a frame rate of 30 fps, and a flip-method of 1, for instance? Similarly, what if I crop the image using the image data manipulations you linked and scale the image to a 720 x 1280 image that I want to show using a glDisplay window of 1280x720 by flipping the image clockwise again?

I tried passing these arguments while initializing gstCamera and videoSource as a second argument string, as part of the input URI string, and other ways, and I simply cannot get it to work. For example,

camera = jetson.utils.videoSource("csi://0 --input-width=4032 --input-height=3040 --flip-method=1")


camera = jetson.utils.videoSource("csi://0", "--input-width=4032 --input-height=3040 --flip-method=1")

Similarly for glDisplay as well.

None of it seems to appropriately affect the gstreamer pipeline that gets used in the background for input stream or properties of the output window. If I am just being completely obtuse, please let me know. I feel that jetson-inference and jetson-utils has almost everything I need if I could figure out how to control these parameters. DeepStream might be a bit too deep for me and I don’t need to run inference on every frame, so I am hoping to use a counter and do it selectively, so jetson-utils and jetson-inference looks like the way to go if I could figure out these details.

Sorry to bump this again.

I have made progress on turning off overlay and processing the detections.

Looks like the way to control videoSource is to use a video options struct. I’m hitting a dead end though on how to set up and pass this structure from python when calling videoSource. Could you please point me to an example?

Unfortunately there aren’t Python bindings for this struct - instead, you can pass a list of options (in command-line format) to the argv keyword argument of videoSource. For example:

input = jetson.utils.videoSource("/dev/video0", argv=["--input-width=640", "--input-height=480"])

Here are the arguments available:

1 Like

I’ve made significant progress on this. I’m able to do all the required operations in cuda except two operations which are causing my fps to drop to 12-13.

  1. A 90 degree flip. Is there any way to do it in cuda? I see the image manipulation functions include crop and resize but not any kind of flip. I don’t see a flip argument for videoOutput either.

  2. Padding. Is there a way to pad an image with single colour border?

Hi @cloud9ine, I don’t have these CUDA functions in my library currently, but they should be fairly simple should you wish to implement them. You can find some examples under jetson-inference/utils/cuda. In particular, these files have simpler functions which you could base your own kernels off of:

1 Like

Thanks, @dusty_nv. I was able to implement my own cudaFlip function, add a python binding, and get it all to work. That part of the code is all good and fast but now glDisplay is taking 0.05 seconds to render each 1280x540 image. That puts my maximum achievable frame rate at 20 fps and because of the other code I’m running, I’m ending up at about 15fps.

Is there any way to speed up the rendering on glDisplay?

I should have said videoOutput in the comment above. I believe it’s using glDisplay in the background.

I used time() to do a rough profile of parts of my code. I tried two variations:

  1. Complete image processing using CUDA, transfer to open cv for display. This is what a typical loop looks like ( I am only doing inference on one out of every 20 frames so the first portion would be a bit longer on that one frame)

    It took 0.000151872634888 seconds for everything until image manipulation using cuda.
    It took 5.72204589844e-05 seconds for image manipulation using cuda.
    It took 0.0481369495392 seconds to transfer image to opencv.
    It took 0.0021071434021 seconds to render the image.

If I render using videoOutput, it becomes:

It took 0.000125885009766 seconds for everything until image manipulation using cuda.
It took 5.91278076172e-05 seconds for image manipulation using cuda.
It took 0.0488979816437 seconds to render the image.

So, opencv seems to be able to render faster but transferring from CUDA to OpenCV (CUDA to Numpy Array plus colorspace conversion) seems to take about as long as glDisplay takes to render the image. Is there any way I can bring down the rendering time by half?

@cloud9ine without doing further profiling inside glDisplay/glTexture, it would be hard to determine if it could be made faster. For display://* outputs, videoSource uses glDisplay (glDisplay is an instance of videoSource). When rendering textures, it uses CUDA<->OpenGL interoperability.

Can you tell if the rendering time is constant or does it vary with the resolution of the image?

Yes, I just checked. It’s taking twice as long to render a 1920x1080 image as it does to render a 1280x720 image.

Is there any alternative? For instance, if I build opencv with cuda support, would we be able to eliminate or speed up the data transfer from cuda to opencv?