Design and architecture guidance

The OpenCV GPU module doesn’t have Python bindings I don’t believe, so the transfer would be the same. After some profiling of glDisplay, it seems that this slow-down is only seen on Nano and not the other Jetson’s, so I’m not sure if it’s a deficiency in my CUDA<->OpenGL code or just if the smaller Nano GPU is already fully utilized.

It looks like at least some functions have python bindings now:

Although I can’t seem to locate any info on which cuda functions have python bindings and which don’t. I guess I will try calling them and find out.

Could this be another way to go? I know it is C++. That is okay,

looks like it is able to use OpenGL support within OpenCV for rendering.

That zero-copy shared memory is what jetson-inference is already using. It also needs to get it in OpenGL for rendering to the display, and that is when it uses CUDA<->OpenGL interoperability.

So if you use jetson-inference in C++, then you have the direct memory pointer to the CUDA data (which is also mapped into CPU memory) and can probably get it into OpenCV more easily. It could be either a CPU cv::mat or GpuMat because the jetson-inference pointers are mapped to both CUDA & CPU (since they are allocated as shared zero-copy memory)

detect-test.cpp (6.1 KB)

Hi @dusty_nv as a start to porting my code to C++, I started with detectnet.cpp in examples and bastardized it a bit. However, it is really unstable and unable to even relay a video stream straight. A lot of the time it gets stuck saying “failed to capture video frame” repeatedly. Even when it somewhat works, it hangs and stops updating the output stream and pretty much bogs down the nano(I have the 2GB version). On the other hand, my python code is able to take in the 1920x1080 stream, crop it, resize it, flip it, and pad it at 15 fps. I feel like something is very wrong in my flow here - could you please tell me if you see something out of place?

Hi @cloud9ine , looking over your C++ code, you are allocating new data each frame (with cudaAllocMapped) and never freeing it, so it is causing a memory leak.

Move your calls to cudaAllocMapped to above the main while() loop, so that the memory is only allocated once at initialization time, and that should help.

Got it. One question: if we have an image that we don’t know the size of but we know the max size possible, is there a way to allocate the max size but for the image to be smaller than the space allocated?

For example, if I’m cropping my 1920x1280 frame to a detected area every frame, is there a way to preallocate the full 1920x1080 size to that pointer and use variables to keep track of the actual image size within that pointer?

Or should I allocate that one on every loop and use cudaFree to release it after every use?

Related question in python as well. If I allocate the space outside the loop and use, say, cudaResize, or img.width and img.height, will it use the actual size of the image data rather than the size of the allocated space?

In C++ you can do that, just allocate the image as the max size. Then keep track of the image resolution you are currently using with variables. I would not allocate/free every loop, because allocating CUDA memory takes time. If the image size is relatively static, each frame you could check the image size and re-allocated only if needed. But just allocating the max size once is easy.

In Python it will use the size of the allocated space, because these dimensions are stored inside the cudaImage object. What you could do is, each frame ceck the current size against desired size, and re-allocated as needed (hopefully occurs infrequently)

Thanks. I switched from the nano to a Xavier nx this weekend and ported my python code and have it working. Since I’m allocating memory outside the loop now, it works great. However, sometimes, when the aspect ratio of target image differs from my display, I’m padding the image. For padding the image, I’m first doing a return in my custom cuda kernel when I’m in the padding area of the output image. The problem is the image data left behind in this space from the most recently displayed full image that had content in the padding region continues to display in my padding margin. What’s the best way to set these pixels to a static color in my custom cuda kernel that would be compatible with all image formats?

You could do a cudaMemset() call before you launch your custom CUDA kernel. You could make this call in your C function that launches your custom CUDA kernel. For example, if you called cudaMemset(ptr, 0, size), it would set all the pixels in the image to black.

Thanks! That worked like a charm. For anyone else reading this, I used sizeof(uchar3) * image_width * image_height for the size argument.

By the way, is there a guide to figuring out camera calibration coefficients and unwarping using CUDA? I see jetson-utils/cudaWarp.h at master · dusty-nv/jetson-utils · GitHub but trying to figure out how to get the necessary coefficients.

Those use the typical intrinsic camera calibration coefficients that you would get from the OpenCV or MATLAB tools:

Hi dusty_nv

I performed camera calibration using opencv (following OpenCV: Camera Calibration) and obtained the following info. I also got rotation and translation vectors (not posted below)

Camera matrix:
[[1.36518031e+03 0.00000000e+00 9.67644312e+02]
[0.00000000e+00 1.36397781e+03 5.37337754e+02]
[0.00000000e+00 0.00000000e+00 1.00000000e+00]]

These are fx, fy, cx, and cy.

Distortion coefficients:
[[-0.54901265 0.49591819 0.00108101 -0.00166793 -0.35424473]]
These are (k1 k2 p1 p2 k3)

I believe the focal lengths and optical centers are intrinsic and the distortion coefficients are extrinsic.

I see that cudawarpintrinsic is in the format:

cudaError_t cudaWarpIntrinsic( uchar4* input, uchar4* output, uint32_t width, uint32_t height,
const float2& focalLength, const float2& principalPoint, const float4& distortion );

Are the principal points the same as the optical centers?

Also, the distortion coefficients are float4, but I have five coefficients - is there documentations on which ones to use?

Is this the right approach?

Hi @cloud9ine,

(fx, fy) is the focal length vector
(cx, cy) is the principal point vector
(k1, k2, p1, p2) is the distortion float4 vector