Correct camera distortion for detectnet-camera


I have put together a pipeline using one of the great examples available at

I changed the gstreamer pipeline to use an ethernet camera. I have observed that the images obtained by my camera are having some distortion.

I generated the calibration parameters using opencv calibration function ‘calibrateCamera()’. But I need some hints on how would I go about correcting the feed real-time that goes into detectnet-camera application. The pipeline reads in images in a buffer and the ‘undistort()’ function in opencv uses cv::Mat.

Has anyone faced this scenario before and know the most efficient way to resolve this so that there is not much drop in the FPS? Any suggestions are really appreciated.


Does the FPS drop without detectnet?

Hi Shane,

Thanks for the reply. I am unable to understand your question. Could you please be more elaborate? Thanks again!

Hi bhargavK, I think Shane was suggesting to isolate the new undistort() function from the detectNet as to measure the performance impact of the camera pre-processing. While you are working on adding this, you may want to make the gst-camera test program work first. It is similar to detectnet-camera in structure, but without the detectnet, so you can confirm the de-warping is working as expected and gauge any performance impact.

If you deem the cv undistort() function to not have adequate performance, VisionWorks and NVIDIA NPP library have optimized warping functions.

If you can express distortion as the inverse of an affine matrix, these libraries may be useful.

Here’s also a CUDA kernel that does intrinsic warping (in the style of MATLAB camera calibration toolbox)

__global__ void gpuIntrinsicWarp( uchar4* input, uchar4* output, int alignedWidth, int width, int height,
                                  float2 focalLength, float2 principalPoint, float k1, float k2, float p1, float p2)
	const int2 uv_out = make_int2(blockDim.x * blockIdx.x + threadIdx.x,
				               blockDim.y * blockIdx.y + threadIdx.y);
	if( uv_out.x >= width || uv_out.y >= height )
	const float u = uv_out.x;
	const float v = uv_out.y;

	const float _fx = 1.0f / focalLength.x;
	const float _fy = 1.0f / focalLength.y;
	const float y      = (v - principalPoint.y)*_fy;
	const float y2     = y*y;
	const float _2p1y  = 2.0*p1*y;
	const float _3p1y2 = 3.0*p1*y2;
	const float p2y2   = p2*y2;

	const float x  = (u - principalPoint.x)*_fx;
	const float x2 = x*x;
	const float r2 = x2 + y2;
	const float d  = 1.0 + (k1 + k2*r2)*r2;
	const float _u = focalLength.x*(x*(d + _2p1y) + p2y2 + (3.0*p2)*x2) + principalPoint.x;
	const float _v = focalLength.y*(y*(d + (2.0*p2)*x) + _3p1y2 + p1*x2) + principalPoint.y;

	const int2 uv_in = make_int2( _u, _v );
	if( uv_in.x >= width || uv_in.y >= height || uv_in.x < 0 || uv_in.y < 0 )

//	printf("%i %i => %i %i\n", uv_out.x, uv_out.y, uv_in.x, uv_in.y);

	output[uv_out.y * alignedWidth + uv_out.x] = input[uv_in.y * alignedWidth + uv_in.x];

cudaError_t cudaWarpIntrinsic( const float2& focalLength, const float2& principalPoint, const float4& distortion,
							   uchar4* input, uchar4* output, const dim3& size )
	if( !input || !output )
		return cudaErrorInvalidDevicePointer;

	if( size.x == 0 || size.y == 0 || size.z == 0 )
		return cudaErrorInvalidValue;

	const int alignedWidth = size.z / sizeof(uchar4);
	// launch kernel
	const dim3 blockDim(16, 16);
	const dim3 gridDim(iDivUp(size.x,blockDim.x), iDivUp(size.y,blockDim.y));

	gpuIntrinsicWarp<<<gridDim, blockDim>>>(input, output, alignedWidth, size.x, size.y,
									focalLength, principalPoint,
									distortion.x, distortion.y, distortion.z, distortion.w);

	return CUDA(cudaGetLastError());

There seems to be overlap with the cv calibration parameters so perhaps you can still use that for calibration.
In this kernel the first four parameters from the MATLAB distortion coefficients are used (k1, k2, p1, p2)
See here about the parameters

Thanks, Dustin, for the clarification on Shane’s comment and suggestions. I will test it on gst-camera.cpp as advised.

I have also found that we can use ‘remap()’ opencv function which has the CUDA function. So I will test both the approaches and use the better one in terms of performance.


I got back to this today, and I merely copied these functions to a .cu file in ‘/util/cuda’ with all the other functions. Then, called this function from gst-camera appropriately.

However, while compiling, I get the following linking error.

[100%] Linking CXX executable ../../../aarch64/bin/gst-camera
CMakeFiles/gst-camera.dir/gst-camera.cpp.o: In function `main':
gst-camera.cpp:(.text+0x80c): undefined reference to `cudaWarpIntrinsic(float2 const&, float2 const&, float3 const&, float2 const&, float4*, float4*, dim3)'
collect2: error: ld returned 1 exit status
util/camera/gst-camera/CMakeFiles/gst-camera.dir/build.make:152: recipe for target 'aarch64/bin/gst-camera' failed
make[2]: *** [aarch64/bin/gst-camera] Error 1
CMakeFiles/Makefile2:287: recipe for target 'util/camera/gst-camera/CMakeFiles/gst-camera.dir/all' failed
make[1]: *** [util/video/gst-camera/CMakeFiles/gst-camera.dir/all] Error 2
Makefile:127: recipe for target 'all' failed
make: *** [all] Error 2

I wonder why it would not link. I tried a few suggestions from stackoverflow in the CMakeLists.txt like adding ‘cuda_add_executable’ and adding ‘$CUDA_LIBRARIES}’ to the target_link_libraries’ action but end up with the same error.

I am probably missing something CMake File? Can’t pinpoint what! Suggestions appreciated.

Never mind, solved it. I can compile it perfectly now.



I have a few follow up questions regarding the image format in the example kernel.

  1. Would the similar function work on float4* images instead of uchar4*?
  2. What exactly does the ‘dim3 size’ represent here. The first and second dimensions are the width and height of the image, but what is the third one? Number of channels? I have set the third dimension to 3 but it doesn’t seem to work.


1. Supposed to.

2. It’s aligned width.
Width means the real horizontal data.
Aligned width means the horizontal buffer size.

Usually, alignment width = width + padding


Did you ever end up comparing the performance of these? Curious to know what you found…

Also, how feasible would it be to use cudaWarpIntrinsic from within the gstreamer pipeline (I guess as a custom element)?


Hi BhargavK. If you don’t mind, can you please share how did you incorporate


in gstreamer pipeline.

I am building a C++ application. Inside which the pipeline looks like

gst-launch-1.0 nvarguscamerasrc maxperf=1 ! 'video/x-raw(memory:NVMM), width=(int)1920, height=(int)1080,
format=(string)NV12, framerate=(fraction)30/1' ! <b>distortioncorrection</b> ! nvv4l2h265enc control-rate=1
bitrate=8000000 ! h265parse ! qtmux ! filesink location=test.mp4 -e

Where, I need a distortioncorrection element. Which can take coefficients Fx, Fy, Cx, Cy, K1, K2, P1, P2 and K3.