Extract preprocessed tensors from nvpreprocess

Hi,

I’ve been trying to debug pgie because it gave different result than the original Pytorch model. I verified the result between Pytorch model and TensorRT engine (FP32 mode) and they are almost similar as expected. However, when I checked the entire DeepStream pipeline, the result is different.
Here’s my pipeline

nvstreammux -> nvdspreprocess -> nvinfer -> ...

I verified the output of nvstreammux. Both padding and size are correct except the color, I think it’s not RGB but i guess the downstream pgie module will take care of it.

For the output extraction from nvstreammux, I followed this example.

Because my pipeline uses nvdspreprocess for preprocessing, I rebuilt both libnvdsgst_preprocess.so and libcustom2d_preprocess.so with DEBUG_LIB and DEBUG_TENSOR flags enabled. This would save impl_out_batch_xxx.bin file. When I checked the size, it wasn’t right.

tensor = np.fromfile('xxx/tensorout_batch_0.bin')
tensor.shape
<<< (992256,)

my network takes width = 1088, height = 608, so here the tensor shape should be 1088 x 608 x 3=1984512. How can we investigate the output from nvdspreprocess when these 2 flags are enabled?

After my failed attempt, I then tried to verify the input of pgie instead since it should be the same as the output of nvdspreprocess. I followed this suggestion. I applied the patch file but for the image saving, I applied the change in queueInputBatchPreprocessed instead of queueInputBatch since the preprocessing in pgie is disabled in favor of nvdspreprocess. The snippet in this function looks as follows

#ifdef DUMP_INPUT_TO_FILE
#define DUMP_FRAME_CNT_START (0)
#define DUMP_FRAME_CNT_STOP (10)
    if ((m_FrameCnt++ >= DUMP_FRAME_CNT_START) &&
        (m_FrameCnt <= DUMP_FRAME_CNT_STOP))
    {
        void *hBuffer;

        printf("batchDims.batchSize = %d\n", batchSize);
        assert(m_AllLayerInfo.size());
        for (size_t i = 0; i < m_AllLayerInfo.size(); i++)
        {
            NvDsInferLayerInfo &info = m_AllLayerInfo[i];
            assert(info.inferDims.numElements > 0);

            if (info.isInput)
            {
                int sizePerBatch =
                    getElementSize(info.dataType) * info.inferDims.numElements;

                cudaDeviceSynchronize();

                for (int b = 0; b < batchSize; b++)
                {
                    float *indBuf =
                        (float *)bindings[i] + b * info.inferDims.numElements;
                    int w = info.inferDims.d[2];
                    int h = info.inferDims.d[1];

                    printf("width = %d, height = %d\n", w, h);
                    printf("sizePerBatch: %d, inferDims.numElements = %u\n", sizePerBatch, info.inferDims.numElements);


                    // if (scale < 1.0)
                    // {
                    //     // R or B
                    //     NvDsInferConvert_FtFTensor(
                    //         (float *)m_inputDumpDeviceBuf,
                    //         indBuf, w, h, w, 1 / scale, NULL, NULL);
                    //     // G
                    //     NvDsInferConvert_FtFTensor(
                    //         ((float *)m_inputDumpDeviceBuf + w * h),
                    //         ((float *)indBuf + w * h),
                    //         w, h, w, 1 / scale, NULL, NULL);
                    //     // B or R
                    //     NvDsInferConvert_FtFTensor(
                    //         ((float *)m_inputDumpDeviceBuf + 2 * w * h),
                    //         ((float *)indBuf + 2 * w * h),
                    //         w, h, w, 1 / scale, NULL, NULL);
                    //     indBuf = (float *)m_inputDumpDeviceBuf;
                    // }

                    RETURN_CUDA_ERR(
                        cudaMemcpy(m_inputDumpHostBuf, (void *)indBuf, sizePerBatch,
                                   cudaMemcpyDeviceToHost),
                        "postprocessing cudaMemcpyAsync for output buffers failed");

                    bool dumpToRaw = false;

                    std::string filename =
                        "gie-" + std::to_string(m_UniqueID) +
                        "_input-" + std::to_string(i) +
                        "_batch-" + std::to_string(b) +
                        "_frame-" + std::to_string(m_FrameCnt);

                    if (dumpToRaw)
                        filename += ".raw";
                    else
                        filename += ".png";

                    NvDsInferFormat format = NvDsInferFormat_RGB;
                    dump_to_file(filename.c_str(),
                                 (unsigned char *)m_inputDumpHostBuf, sizePerBatch,
                                 w, h, dumpToRaw, format);
                }
            }
        }
    }
#endif

files are saved and the shape is correct but everything is black. It doesn’t seem right to me.

so I investigate the shape

img = cv2.imread('/home/coco/workspace/sertis/object-tracking-app/gie-1_input-0_batch-0_frame-1.png')
img.shape
<<< (608, 1088, 3)

in this case, what could go wrong? is this the right way we should debug the preprocessed tensors?

Please find my configs here

[property]
enable=1
target-unique-ids=1

# network-input-shape: batch, channel, height, width
network-input-shape=1;3;608;1088

# 0=RGB, 1=BGR, 2=GRAY
network-color-format=0
# 0=NCHW, 1=NHWC, 2=CUSTOM
network-input-order=0
# 0=FP32, 1=UINT8, 2=INT8, 3=UINT32, 4=INT32, 5=FP16
tensor-data-type=0
tensor-name=images

processing-width=1088
processing-height=608

# 0=NVBUF_MEM_DEFAULT 1=NVBUF_MEM_CUDA_PINNED 2=NVBUF_MEM_CUDA_DEVICE
# 3=NVBUF_MEM_CUDA_UNIFIED  4=NVBUF_MEM_SURFACE_ARRAY(Jetson)
scaling-pool-memory-type=0

# 0=NvBufSurfTransformCompute_Default 1=NvBufSurfTransformCompute_GPU
# 2=NvBufSurfTransformCompute_VIC(Jetson)
scaling-pool-compute-hw=0

# Scaling Interpolation method
# 0=NvBufSurfTransformInter_Nearest 1=NvBufSurfTransformInter_Bilinear 2=NvBufSurfTransformInter_Algo1
# 3=NvBufSurfTransformInter_Algo2 4=NvBufSurfTransformInter_Algo3 5=NvBufSurfTransformInter_Algo4
# 6=NvBufSurfTransformInter_Default
scaling-filter=0

# model input tensor pool size
tensor-buf-pool-size=8

# custom-lib-path=/opt/nvidia/deepstream/deepstream-6.0/lib/gst-plugins/libcustom2d_preprocess.so
custom-lib-path=/opt/nvidia/deepstream/deepstream/sources/gst-plugins/gst-nvdspreprocess/nvdspreprocess_lib/libcustom2d_preprocess.so
custom-tensor-preparation-function=CustomTensorPreparation

[user-configs]
pixel-normalization-factor=0.003921568

# Currently, nvsdpreprocess is in alpha stage, thus, std scaling is not yet supported.
# preprocessing logic is as follows
#
# out = pixel-normalization-factor * (x - mean[c])
#
# more detail, see https://docs.nvidia.com/metropolis/deepstream/dev-guide/text/DS_plugin_gst-nvdspreprocess.html

# ByteTrack's mean: 0.485, 0.456, 0.406
# ByteTrack's std: 0.229, 0.224, 0.225
# rescale back to [0-255]:
# 0.485 * 255 = 123.675
# 0.456 * 255 = 116.28
# 0.406 * 255 = 103.53
offsets=123.675;116.28;103.53

[group-0]
# set src-ids=-1 to use batch_size in network-input-shape
src-ids=-1
custom-input-transformation-function=CustomAsyncTransformation
process-on-roi=0
# roi-params-src-0=0;0;1088;608

model: here

Note: I already disabled pgie’s default preprocessing by setting input-tensor-meta=1 and also set model-color-format=0 for RGB input.

Environment
Architecture: x86_64
GPU: NVIDIA GeForce GTX 1650 Ti with Max-Q Design
NVIDIA GPU Driver: Driver Version: 495.29.05
DeepStream Version: 6.0 (running on docker image nvcr.io/nvidia/deepstream:6.0-devel)
TensorRT Version: v8001
Issue Type: Question

Hi @ peeranat85
Comprared to the original patch, you removed at least two lines below.
scale is important to get a correct image, because, seems you set scale to 0.003921568 according to “pixel-normalization-factor=0.003921568”, if you don’t de-normalize the data with this scale, the pixel data in the output file are very small (from -1 to 1), which will lead to dark image as you saw now.

+        float scale = m_Preprocessor->getScale();
+        NvDsInferFormat format = m_Preprocessor->getNetworkFormat();

Hi @mchi
Thanks for your response. I removed it because it gave a segfault. This is because m_Preprocessor is nullptr as the preprocessing is executed in nvpreprocess module. Btw, I didn’t intend to denormalize the image in the first place. I tried to extract the raw preprocessed tensors and compare with the ones from Python to make sure they are the same.

so, things are explained, right? any other question?

No, not yet. After further investigation, saving preprocessed tensors into a file using cv2.imwrite results in precision loss as seen in the dark image I sent earlier. To fix this, I saved the raw binary file instead like so

std::ofstream dump_file(filename, std::ios::binary);
dump_file.write((const char *)buffer, size);

This resulted in a huge file
gie-1_input-0_batch-0_frame-1.raw (7.6 MB)

I checked the file

with open("xxx/gie-1_input-0_batch-0_frame-1.raw", 'rb') as f:
    data = np.fromfile(f, dtype=np.float32)

output = data.reshape((608, 1088, 3))
plt.imshow(output)

This is not what I expected. How come there are multiple instances of a single frame and the color appears to be in gray? See original frame below.

This is what I got from Python code and it is what I expected (or something similar).

def letterbox(img, height=608, width=1088, color=(0, 0, 0)):  # resize a rectangular image to a padded rectangular 
    shape = img.shape[:2]  # shape = [height, width]
    ratio = min(float(height)/shape[0], float(width)/shape[1])
    new_shape = (round(shape[1] * ratio), round(shape[0] * ratio)) # new_shape = [width, height]
    dw = (width - new_shape[0]) / 2  # width padding
    dh = (height - new_shape[1]) / 2  # height padding
    top, bottom = round(dh - 0.1), round(dh + 0.1)
    left, right = round(dw - 0.1), round(dw + 0.1)
    img = cv2.resize(img, new_shape, interpolation=cv2.INTER_AREA)  # resized, no border
    img = cv2.copyMakeBorder(img, top, bottom, left, right, cv2.BORDER_CONSTANT, value=color)  # padded rectangular
    return img, ratio, dw, dh

img = cv2.imread(img_path)
img, r, dw, dh = letterbox(img, color=(0, 0, 0))
img = img[:, :, ::-1].astype(np.float32)
img /= 255.0
img -= (0.485, 0.456, 0.406) # rgb mean
plt.imshow(img)


Note that the preprocessing logic here is the same as configured in DeepStream’s nvpreprocess module.

To summarize the question,

  1. with this preprocessing logic, why I didn’t get the same output (or something similar) as in Python?
  2. In gie-1_input-0_batch-0_frame-1.raw, how come there are multiple instances of a single frame and the color appears to be in gray despite setting model-color-format=0?

what’s the batch size of pgie?

the batch size is 1

FYI, Just to give you an update. When I switched to using pgie’s preprocessing (batch-size=1) and removing nvpreprocess plugin. The output png file looks correct.

However, the raw binary stills gives a pack of images as before.

nvinver-gie-1_input-0_batch-0_frame-1.raw (7.6 MB)

Not sure this is the right way to check the raw binary file?

Finally, I managed to find out why I got a pack of images when visualizing the raw binary file.

The data.reshape((608, 1088, 3)) should be as follows


The channel must come first during the reshape. Now the preprocessed image looks similar to that from Python code.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.