How to correctly parse the result as a disparity map?
The result from the nvmedia_imageofst sample application for computing the disparity is of int16 type, and it looks like a vector of all non-positive numbers.
What is the correct way to convert the result into a disparity map array?
negate the negative numbers? re-normalize in some way?
Each value (d) of a disparity map with respect to the left image at (x,y) has the meaning of the number of pixel offset of the same point of interest from left image (at (x,y)) to the right image (at (x, y+d)). So I need the numbers in unit of pixels under this semantic.
I’ve searched a bit more on Google and find this piece of code/comment
/**
* \struct NV_OF_STEREO_DISPARITY
* Struct needed for stereo /disparity. ::NV_OF_OUTPUT_EXECUTE_PARAMS::outputBuffer will be populated
* with stereo disparity in ::NV_OF_STEREO_DISPARITY format for each ::NV_OF_INIT_PARAMS::outGridSize.
* Stereo disparity is a 16-bit value with the lowest 5 bits holding fractional value,
* followed by a 11-bit unsigned integer value.
*/
typedef struct _NV_OF_STEREO_DISPARITY
{
uint16_t disparity; /**< Horizontal displacement[in pixels] in 11.5 format. */
} NV_OF_STEREO_DISPARITY;
in this NVIDIAOpticalFlowSDK github repo:
It feels like a reasonable way of coding the disparity in 16bits, but after a simple experiment, I find parsing the result from the DRIVE OFST API using this coding is not correct.
I guess all negative is because of our implementation always using right image as the reference frame (see NvMediaIOFSTProcessFrame). I’ll check further and get back to you here.
Yes, in the current version stereo disparity values are all negative. It is basically motion vector in left direction. So you can take absolute value of the output to get disparity.
shift each value by 5 bits to get the top 11 bits of the value
right?
But I find the values much smaller than what I get using opencv.
So my guess is the result disparity values is scaled down in some way.
Given the result disparity array size is scaled down from the input images by 4 (original image WxH → result array W/4 x H/4). Are the disparity values in the result array computed on the scale of W/4xH/4 (i.e. disparity block offset instead of pixel offset), therefore we should scale the result values up by some number 4 or 16 to get pixel offsets?
Previously I was using pyplot to draw the gray scale disparity images, it was not correct because pyplot seems to normalize the values by itself. Now I have the results saved by the raw disparity numbers as images. Here’s the results:
OpenCV StereoBM actually also has fixed point float number format, the lower 4 bits are fractional. So the integral part of the result from the Nvidia drive xavier indeed needs to scale up by 16 to have comparable results against opencv (or better, scale down the original result by 2).The groundtruth seems to have the same binary format as the opencv StereoBM. This confused me.
I think everything is almost clear now. But I’ll try more samples before closing this issue
disparity Output disparity map. It has the same size as the input images. Some algorithms, like StereoBM or StereoSGBM compute 16-bit fixed-point disparity map (where each disparity value has 4 fractional bits), whereas other algorithms output 32-bit floating-point disparity map.
I’ve tried dozens of more samples, the result in general looks promising, but the disparity computation deterministicly fail at some magical image size in Linux.
I’m summarizing all that I’ve found here to close this series:
The IOFST API result is of size: width: ((input_img.width + 15)/16) * 4 height: ((input_img.height + 15)/16) * 4
The IOFST API result is an array of int16 type; all non-positive. Each value is a fixed point float, with the low 5 bits fractional, high 11 bits integral. Each disparity value can be decoded by -v/32.0, where v is any value in the returned array.