Design and architecture guidance

I used time() to do a rough profile of parts of my code. I tried two variations:

  1. Complete image processing using CUDA, transfer to open cv for display. This is what a typical loop looks like ( I am only doing inference on one out of every 20 frames so the first portion would be a bit longer on that one frame)

    It took 0.000151872634888 seconds for everything until image manipulation using cuda.
    It took 5.72204589844e-05 seconds for image manipulation using cuda.
    It took 0.0481369495392 seconds to transfer image to opencv.
    It took 0.0021071434021 seconds to render the image.

If I render using videoOutput, it becomes:

It took 0.000125885009766 seconds for everything until image manipulation using cuda.
It took 5.91278076172e-05 seconds for image manipulation using cuda.
It took 0.0488979816437 seconds to render the image.

So, opencv seems to be able to render faster but transferring from CUDA to OpenCV (CUDA to Numpy Array plus colorspace conversion) seems to take about as long as glDisplay takes to render the image. Is there any way I can bring down the rendering time by half?

@cloud9ine without doing further profiling inside glDisplay/glTexture, it would be hard to determine if it could be made faster. For display://* outputs, videoSource uses glDisplay (glDisplay is an instance of videoSource). When rendering textures, it uses CUDA<->OpenGL interoperability.

Can you tell if the rendering time is constant or does it vary with the resolution of the image?

Yes, I just checked. It’s taking twice as long to render a 1920x1080 image as it does to render a 1280x720 image.

Is there any alternative? For instance, if I build opencv with cuda support, would we be able to eliminate or speed up the data transfer from cuda to opencv?

The OpenCV GPU module doesn’t have Python bindings I don’t believe, so the transfer would be the same. After some profiling of glDisplay, it seems that this slow-down is only seen on Nano and not the other Jetson’s, so I’m not sure if it’s a deficiency in my CUDA<->OpenGL code or just if the smaller Nano GPU is already fully utilized.

It looks like at least some functions have python bindings now:

Although I can’t seem to locate any info on which cuda functions have python bindings and which don’t. I guess I will try calling them and find out.

Could this be another way to go? I know it is C++. That is okay,

looks like it is able to use OpenGL support within OpenCV for rendering.

That zero-copy shared memory is what jetson-inference is already using. It also needs to get it in OpenGL for rendering to the display, and that is when it uses CUDA<->OpenGL interoperability.

So if you use jetson-inference in C++, then you have the direct memory pointer to the CUDA data (which is also mapped into CPU memory) and can probably get it into OpenCV more easily. It could be either a CPU cv::mat or GpuMat because the jetson-inference pointers are mapped to both CUDA & CPU (since they are allocated as shared zero-copy memory)

detect-test.cpp (6.1 KB)

Hi @dusty_nv as a start to porting my code to C++, I started with detectnet.cpp in examples and bastardized it a bit. However, it is really unstable and unable to even relay a video stream straight. A lot of the time it gets stuck saying “failed to capture video frame” repeatedly. Even when it somewhat works, it hangs and stops updating the output stream and pretty much bogs down the nano(I have the 2GB version). On the other hand, my python code is able to take in the 1920x1080 stream, crop it, resize it, flip it, and pad it at 15 fps. I feel like something is very wrong in my flow here - could you please tell me if you see something out of place?

Hi @cloud9ine , looking over your C++ code, you are allocating new data each frame (with cudaAllocMapped) and never freeing it, so it is causing a memory leak.

Move your calls to cudaAllocMapped to above the main while() loop, so that the memory is only allocated once at initialization time, and that should help.

Got it. One question: if we have an image that we don’t know the size of but we know the max size possible, is there a way to allocate the max size but for the image to be smaller than the space allocated?

For example, if I’m cropping my 1920x1280 frame to a detected area every frame, is there a way to preallocate the full 1920x1080 size to that pointer and use variables to keep track of the actual image size within that pointer?

Or should I allocate that one on every loop and use cudaFree to release it after every use?

Related question in python as well. If I allocate the space outside the loop and use, say, cudaResize, or img.width and img.height, will it use the actual size of the image data rather than the size of the allocated space?

In C++ you can do that, just allocate the image as the max size. Then keep track of the image resolution you are currently using with variables. I would not allocate/free every loop, because allocating CUDA memory takes time. If the image size is relatively static, each frame you could check the image size and re-allocated only if needed. But just allocating the max size once is easy.

In Python it will use the size of the allocated space, because these dimensions are stored inside the cudaImage object. What you could do is, each frame ceck the current size against desired size, and re-allocated as needed (hopefully occurs infrequently)

Thanks. I switched from the nano to a Xavier nx this weekend and ported my python code and have it working. Since I’m allocating memory outside the loop now, it works great. However, sometimes, when the aspect ratio of target image differs from my display, I’m padding the image. For padding the image, I’m first doing a return in my custom cuda kernel when I’m in the padding area of the output image. The problem is the image data left behind in this space from the most recently displayed full image that had content in the padding region continues to display in my padding margin. What’s the best way to set these pixels to a static color in my custom cuda kernel that would be compatible with all image formats?

You could do a cudaMemset() call before you launch your custom CUDA kernel. You could make this call in your C function that launches your custom CUDA kernel. For example, if you called cudaMemset(ptr, 0, size), it would set all the pixels in the image to black.

Thanks! That worked like a charm. For anyone else reading this, I used sizeof(uchar3) * image_width * image_height for the size argument.

By the way, is there a guide to figuring out camera calibration coefficients and unwarping using CUDA? I see jetson-utils/cudaWarp.h at master · dusty-nv/jetson-utils · GitHub but trying to figure out how to get the necessary coefficients.

Those use the typical intrinsic camera calibration coefficients that you would get from the OpenCV or MATLAB tools:

Hi dusty_nv

I performed camera calibration using opencv (following OpenCV: Camera Calibration) and obtained the following info. I also got rotation and translation vectors (not posted below)

Camera matrix:
[[1.36518031e+03 0.00000000e+00 9.67644312e+02]
[0.00000000e+00 1.36397781e+03 5.37337754e+02]
[0.00000000e+00 0.00000000e+00 1.00000000e+00]]

These are fx, fy, cx, and cy.

Distortion coefficients:
[[-0.54901265 0.49591819 0.00108101 -0.00166793 -0.35424473]]
These are (k1 k2 p1 p2 k3)

I believe the focal lengths and optical centers are intrinsic and the distortion coefficients are extrinsic.

I see that cudawarpintrinsic is in the format:

cudaError_t cudaWarpIntrinsic( uchar4* input, uchar4* output, uint32_t width, uint32_t height,
const float2& focalLength, const float2& principalPoint, const float4& distortion );

Are the principal points the same as the optical centers?

Also, the distortion coefficients are float4, but I have five coefficients - is there documentations on which ones to use?

Is this the right approach?

Hi @cloud9ine,

(fx, fy) is the focal length vector
(cx, cy) is the principal point vector
(k1, k2, p1, p2) is the distortion float4 vector

Thanks!

I tried this but looks like there isn’t a prototype for this function that takes uchar3for the images. It’s looking for uchar4 or float4*. The image I get from videoSource seems to be uchar3*. Is there a way to somehow use this function with input and output being uchar3*? Or is there a way to convert the uchar3* image to uchar4* or float 4*?

You can use cudaConvertColor() function or just change the type of pointer you pass to videoSource::Capture(). For example, change this line from uchar3* to uchar4* (or float4):

https://github.com/dusty-nv/jetson-utils/blob/c373f49cf21ad2cae7e4d7da7c41f4fd6473958f/video/video-viewer/video-viewer.cpp#L105