Design and architecture guidance

dusty_nv · January 25, 2021, 5:43pm

The OpenCV GPU module doesn’t have Python bindings I don’t believe, so the transfer would be the same. After some profiling of glDisplay, it seems that this slow-down is only seen on Nano and not the other Jetson’s, so I’m not sure if it’s a deficiency in my CUDA<->OpenGL code or just if the smaller Nano GPU is already fully utilized.

cloud9ine · January 25, 2021, 6:12pm

It looks like at least some functions have python bindings now:

Although I can’t seem to locate any info on which cuda functions have python bindings and which don’t. I guess I will try calling them and find out.

cloud9ine · January 25, 2021, 6:20pm

Could this be another way to go? I know it is C++. That is okay,

looks like it is able to use OpenGL support within OpenCV for rendering.

dusty_nv · January 25, 2021, 6:39pm

That zero-copy shared memory is what jetson-inference is already using. It also needs to get it in OpenGL for rendering to the display, and that is when it uses CUDA<->OpenGL interoperability.

So if you use jetson-inference in C++, then you have the direct memory pointer to the CUDA data (which is also mapped into CPU memory) and can probably get it into OpenCV more easily. It could be either a CPU cv::mat or GpuMat because the jetson-inference pointers are mapped to both CUDA & CPU (since they are allocated as shared zero-copy memory)

cloud9ine · January 29, 2021, 8:32am

detect-test.cpp (6.1 KB)

Hi @dusty_nv as a start to porting my code to C++, I started with detectnet.cpp in examples and bastardized it a bit. However, it is really unstable and unable to even relay a video stream straight. A lot of the time it gets stuck saying “failed to capture video frame” repeatedly. Even when it somewhat works, it hangs and stops updating the output stream and pretty much bogs down the nano(I have the 2GB version). On the other hand, my python code is able to take in the 1920x1080 stream, crop it, resize it, flip it, and pad it at 15 fps. I feel like something is very wrong in my flow here - could you please tell me if you see something out of place?

dusty_nv · January 29, 2021, 4:24pm

Hi @cloud9ine , looking over your C++ code, you are allocating new data each frame (with cudaAllocMapped) and never freeing it, so it is causing a memory leak.

Move your calls to cudaAllocMapped to above the main while() loop, so that the memory is only allocated once at initialization time, and that should help.

cloud9ine · January 29, 2021, 4:31pm

Got it. One question: if we have an image that we don’t know the size of but we know the max size possible, is there a way to allocate the max size but for the image to be smaller than the space allocated?

For example, if I’m cropping my 1920x1280 frame to a detected area every frame, is there a way to preallocate the full 1920x1080 size to that pointer and use variables to keep track of the actual image size within that pointer?

Or should I allocate that one on every loop and use cudaFree to release it after every use?

cloud9ine · January 29, 2021, 5:17pm

Related question in python as well. If I allocate the space outside the loop and use, say, cudaResize, or img.width and img.height, will it use the actual size of the image data rather than the size of the allocated space?

dusty_nv · January 29, 2021, 6:42pm

In C++ you can do that, just allocate the image as the max size. Then keep track of the image resolution you are currently using with variables. I would not allocate/free every loop, because allocating CUDA memory takes time. If the image size is relatively static, each frame you could check the image size and re-allocated only if needed. But just allocating the max size once is easy.

In Python it will use the size of the allocated space, because these dimensions are stored inside the cudaImage object. What you could do is, each frame ceck the current size against desired size, and re-allocated as needed (hopefully occurs infrequently)

cloud9ine · February 1, 2021, 7:08am

Thanks. I switched from the nano to a Xavier nx this weekend and ported my python code and have it working. Since I’m allocating memory outside the loop now, it works great. However, sometimes, when the aspect ratio of target image differs from my display, I’m padding the image. For padding the image, I’m first doing a return in my custom cuda kernel when I’m in the padding area of the output image. The problem is the image data left behind in this space from the most recently displayed full image that had content in the padding region continues to display in my padding margin. What’s the best way to set these pixels to a static color in my custom cuda kernel that would be compatible with all image formats?

dusty_nv · February 1, 2021, 4:57pm

You could do a cudaMemset() call before you launch your custom CUDA kernel. You could make this call in your C function that launches your custom CUDA kernel. For example, if you called cudaMemset(ptr, 0, size), it would set all the pixels in the image to black.

cloud9ine · February 1, 2021, 7:17pm

Thanks! That worked like a charm. For anyone else reading this, I used sizeof(uchar3) * image_width * image_height for the size argument.

By the way, is there a guide to figuring out camera calibration coefficients and unwarping using CUDA? I see jetson-utils/cudaWarp.h at master · dusty-nv/jetson-utils · GitHub but trying to figure out how to get the necessary coefficients.

dusty_nv · February 1, 2021, 8:50pm

Those use the typical intrinsic camera calibration coefficients that you would get from the OpenCV or MATLAB tools:

cloud9ine · April 6, 2021, 7:05pm

Hi dusty_nv

I performed camera calibration using opencv (following OpenCV: Camera Calibration) and obtained the following info. I also got rotation and translation vectors (not posted below)

Camera matrix:
[[1.36518031e+03 0.00000000e+00 9.67644312e+02]
[0.00000000e+00 1.36397781e+03 5.37337754e+02]
[0.00000000e+00 0.00000000e+00 1.00000000e+00]]

These are fx, fy, cx, and cy.

Distortion coefficients:
[[-0.54901265 0.49591819 0.00108101 -0.00166793 -0.35424473]]
These are (k1 k2 p1 p2 k3)

I believe the focal lengths and optical centers are intrinsic and the distortion coefficients are extrinsic.

I see that cudawarpintrinsic is in the format:

cudaError_t cudaWarpIntrinsic( uchar4* input, uchar4* output, uint32_t width, uint32_t height,
const float2& focalLength, const float2& principalPoint, const float4& distortion );

Are the principal points the same as the optical centers?

Also, the distortion coefficients are float4, but I have five coefficients - is there documentations on which ones to use?

Is this the right approach?

dusty_nv · April 6, 2021, 7:24pm

Hi @cloud9ine,

(fx, fy) is the focal length vector
(cx, cy) is the principal point vector
(k1, k2, p1, p2) is the distortion float4 vector

cloud9ine · April 11, 2021, 8:06pm

Thanks!

I tried this but looks like there isn’t a prototype for this function that takes uchar3for the images. It’s looking for uchar4 or float4*. The image I get from videoSource seems to be uchar3*. Is there a way to somehow use this function with input and output being uchar3*? Or is there a way to convert the uchar3* image to uchar4* or float 4*?

dusty_nv · April 12, 2021, 1:24am

You can use cudaConvertColor() function or just change the type of pointer you pass to videoSource::Capture(). For example, change this line from uchar3* to uchar4* (or float4):

https://github.com/dusty-nv/jetson-utils/blob/c373f49cf21ad2cae7e4d7da7c41f4fd6473958f/video/video-viewer/video-viewer.cpp#L105

cloud9ine · April 12, 2021, 3:20am

Switching all my images to uchar4* worked like a charm - thanks!

Unrelated question: If I want to take keyboard or other HID input into my C++ program that uses these jetson utils and inference, how would I go about it? I am comparing it to the waitKey function in opencv. Also, if I am using VideoOutput/glDisplay, will the program still get the input key/button press even if the video window is in focus and not the terminal window?

dusty_nv · April 12, 2021, 1:47pm

The simplest way would probably be to use the glDisplay::GetKey() function. You can use it like this:

#include <X11/keysymdef.h>
#include "glDisplay.h"

// create the output stream as normal
videoOutput* outputStream = videoOutput::Create(cmdLine, ARG_POSITION(1));

// cast to glDisplay
glDisplay* display = NULL;

if( outputStream->IsType<glDisplay>() )
   display = (glDisplay*)outputStream;

// query key status
if( display->GetKey(XK_a) )
    printf("A key is down\n");

If you want events instead, you can add an event handler callback to your glDisplay instance. See here for the events that are defined: https://github.com/dusty-nv/jetson-utils/blob/c373f49cf21ad2cae7e4d7da7c41f4fd6473958f/display/glEvents.h

cloud9ine · April 12, 2021, 4:44pm

You meant

display = (glDisplay*)outputStream;

right?

I tried with this, verified that I am getting true for

outputStream->IsType<glDisplay>()

Then, in the loop, I am trying to use

if( display->GetKey(XK_m) )

but it never returns true even if I hold down the “m” key continuously with the video output window in focus. Am I missing something? Does the key cache get cleared every time I render a new image to the output window?