parsing Stereo Disparity output (nvmedia_imageofst) as a disparity map

Hi Nvidia,

Follow up on the post asking for help in computing stereo disparity:

https://devtalk.nvidia.com/default/topic/1070229/general/how-to-correctly-compute-stereo-disparity-nvmedia_imageofst-/

How to correctly parse the result as a disparity map?

The result from the nvmedia_imageofst sample application for computing the disparity is of int16 type, and it looks like a vector of all non-positive numbers.

What is the correct way to convert the result into a disparity map array?

negate the negative numbers? re-normalize in some way?

Thanks.

Hi shengliang.xu,

Doesn’t the result in https://devtalk.nvidia.com/default/topic/1070229/general/how-to-correctly-compute-stereo-disparity-nvmedia_imageofst-/post/5423217/#5423217 already look good?

Hi Vick,

I need the disparity map in unit of pixels. The result in the other post is not on the correct unit for sure given all numbers are non-positive.

The ground truth of the tsukuba sample stereo pair is:

The result from the nvmedia_imageofst has correct segmentations but the disparity values (relative grayscale) is not correct:

Is this just a matter of color mapping of imshow? Could you check https://matplotlib.org/api/_as_gen/matplotlib.pyplot.imshow.html to reverse it? Thanks!

Hi Vick,

It’s not a coloring issue.

Each value (d) of a disparity map with respect to the left image at (x,y) has the meaning of the number of pixel offset of the same point of interest from left image (at (x,y)) to the right image (at (x, y+d)). So I need the numbers in unit of pixels under this semantic.

I’ve searched a bit more on Google and find this piece of code/comment

/**
* \struct NV_OF_STEREO_DISPARITY
* Struct needed for stereo /disparity. ::NV_OF_OUTPUT_EXECUTE_PARAMS::outputBuffer will be populated
* with stereo disparity in ::NV_OF_STEREO_DISPARITY format for each ::NV_OF_INIT_PARAMS::outGridSize.
* Stereo disparity is a 16-bit value with the lowest 5 bits holding fractional value,
* followed by a 11-bit unsigned integer value.
*/
typedef struct _NV_OF_STEREO_DISPARITY
{
    uint16_t                        disparity;    /**< Horizontal displacement[in pixels] in 11.5 format. */
} NV_OF_STEREO_DISPARITY;

in this NVIDIAOpticalFlowSDK github repo:

https://github.com/NVIDIA/NVIDIAOpticalFlowSDK/blob/master/nvOpticalFlowCommon.h

It feels like a reasonable way of coding the disparity in 16bits, but after a simple experiment, I find parsing the result from the DRIVE OFST API using this coding is not correct.

I’m quite confused.

I guess all negative is because of our implementation always using right image as the reference frame (see NvMediaIOFSTProcessFrame). I’ll check further and get back to you here.

Hi Vick,

Any update on this issue?

Thanks.

Hi shengliang.xu,

Yes, in the current version stereo disparity values are all negative. It is basically motion vector in left direction. So you can take absolute value of the output to get disparity.

We will try to fix this issue. Thanks!

Thank you Vick.

So then the two steps to get the disparity is

  1. negate the values
  2. shift each value by 5 bits to get the top 11 bits of the value

right?

But I find the values much smaller than what I get using opencv.

So my guess is the result disparity values is scaled down in some way.

Given the result disparity array size is scaled down from the input images by 4 (original image WxH -> result array W/4 x H/4). Are the disparity values in the result array computed on the scale of W/4xH/4 (i.e. disparity block offset instead of pixel offset), therefore we should scale the result values up by some number 4 or 16 to get pixel offsets?

Yes, getting absolute value and shifting right by 5 should give disparity in pixels.

How did you compare to opencv? StereoBM does subpixel refinement. Could you take a look and check if opencv output is scaled? Thanks!

Previously I was using pyplot to draw the gray scale disparity images, it was not correct because pyplot seems to normalize the values by itself. Now I have the results saved by the raw disparity numbers as images. Here’s the results:

The ground truth:

The opencv result:

The xavier drive result by

  1. negate the values
  2. shift each value by 5 bits to get the top 11 bits of the value
  3. resize the result to the size of the original image :

As you can see the values are apparently not on the correct scale.
tsukuba_groundtruth.png
tsukuba.disparity.png
tsukuba.disparity.opencv.png

Thank you for the information!

Please provide all the details about how to get the ground truth, how to generate the disparity map from opencv with arguments and how to generate from nvmedia_imageofst (I preassume using the same command at https://devtalk.nvidia.com/default/topic/1070229/general/how-to-correctly-compute-stereo-disparity-nvmedia_imageofst-/post/5423217/#5423217, right?).

ground truth, it’s available on the middlebury stereo dataset website:

http://vision.middlebury.edu/stereo/eval/newEval/tsukuba/groundtruth.html

left input image:
http://vision.middlebury.edu/stereo/eval/newEval/tsukuba/im3.png
right input image:
http://vision.middlebury.edu/stereo/eval/newEval/tsukuba/im4.png

The opencv code:

import numpy as np
import cv2

imgL = cv2.imread('tsukuba_left.png', cv2.IMREAD_GRAYSCALE)
imgR = cv2.imread('tsukuba_right.png', cv2.IMREAD_GRAYSCALE)

stereo = cv2.StereoBM_create(numDisparities=16, blockSize=15)
disparity = stereo.compute(imgL, imgR)

cv2.imwrite('tsukuba.disparity.opencv.png', disparity)

nvidia xavier disparity generation is using the command at the other post you’ve linked. The post processing python code is:

import cv2
import numpy as np

fd = open('tsukuba.disparity', 'rb')
rows = 72
cols = 96

f = np.fromfile(fd, dtype=np.int16, count=rows*cols)
fd.close()

f = np.fromiter((((-xi)/32) for xi in f), f.dtype)

disparity = f.reshape((rows, cols))

resized = cv2.resize(disparity, (cols * 4, rows * 4), interpolation = cv2.INTER_AREA)

cv2.imwrite('tsukuba.disparity.nvidia.png', resized)

Thanks! We will check internally and get back to you here.

Hi Vick,

I find the problem. Sorry my bad.

OpenCV StereoBM actually also has fixed point float number format, the lower 4 bits are fractional. So the integral part of the result from the Nvidia drive xavier indeed needs to scale up by 16 to have comparable results against opencv (or better, scale down the original result by 2).The groundtruth seems to have the same binary format as the opencv StereoBM. This confused me.

I think everything is almost clear now. But I’ll try more samples before closing this issue

Thank you.

Thanks for sharing the information! I think you are talking about the disparity decription in https://docs.opencv.org/master/d2/d6e/classcv_1_1StereoMatcher.html#a03f7087df1b2c618462eb98898841345.

disparity	Output disparity map. It has the same size as the input images. Some algorithms, like StereoBM or StereoSGBM compute 16-bit fixed-point disparity map (where each disparity value has 4 fractional bits), whereas other algorithms output 32-bit floating-point disparity map.

yes, thanks.

I’ve tried dozens of more samples, the result in general looks promising, but the disparity computation deterministicly fail at some magical image size in Linux.

I’ll open a different post for this issue.

I’m summarizing all that I’ve found here to close this series:

  1. The IOFST API result is of size: width: ((input_img.width + 15)/16) * 4 height: ((input_img.height + 15)/16) * 4

  2. The IOFST API result is an array of int16 type; all non-positive. Each value is a fixed point float, with the low 5 bits fractional, high 11 bits integral. Each disparity value can be decoded by -v/32.0, where v is any value in the returned array.

Thanks for the summary! still some to clarify with you.

The output surface contains 4×4 downsampled MV’s. The size of each MV is:
Stereo Disparity: 2 bytes (MVx)

Why do you divide by 16?

Yes, according to different format:
NV OFST output - divide by 32
OPENCV output - divide by 16
Middlebury ground truth - divide by 8

Sorry, typo, fixed. The size according to the code should be:

width: ((input_img.width + 15)/16) * 4 height: ((input_img.height + 15)/16) * 4