Reflecting Pytorch Normalize transform parameter to Deepstream configuration

Setup Information
• Hardware Platform - GPU
• DeepStream Version - 5.0 (using Docker image nvcr.io/nvidia/deepstream:5.0-20.07-triton)
• TensorRT Version - 7.0.0-1+cuda10.2
• NVIDIA GPU Driver Version (valid for GPU only) - Driver Version: 455.32.00 CUDA Version: 11.1
• Issue Type - Question

Problem/Use case:
I am working on using Pytorch model with Triton Inference Server. However, in the Pytorch code, there is a particular transform implemented on the input image before feeding to the model for classification.

I have referred to the following 2 posts :

  1. PyTorch normalization in Deepstream config
  2. Image preprocess question

But I still couldn’t understand the explanation/math equation mentioned in these 2 posts.

The image transformation in PyTorch are like below:

self.mean = [0.485, 0.456, 0.406]
self.std = [0.229, 0.224, 0.225]

normalize = transforms.Normalize(mean=self.mean, std=self.std)
test_transform = transforms.Compose([
    transforms.Resize(cfg.resize),
    transforms.ToTensor(),
    normalize, ])

I would like to reflect the above transformation (Resize & Normalization) similarly on Deepstream pipeline with Triton Inference Server (nvinferserver). Hence, I filled up the configuration values to as below:

preprocess {
    network_format: IMAGE_FORMAT_RGB
    tensor_order: TENSOR_ORDER_LINEAR
    maintain_aspect_ratio: 1
    normalize {
      scale_factor: 0.017353
      channel_offsets: [123.69, 116.31, 103.52]
    }
  }

I would assume that the range of input pixel to this network would be [0, 255]. I have two main transformation needing guidance and help.

A. Normalization

For Pytorch, the values of pixel before applying std and mean is [0, 1] (reference). Upon applying the Normalize transform, I expect a range of [-2.146, 2.279] for channel 0, [-2.018, 2.407] for channel 1 and [-1.796, 2.628] for channel 2.

By using the resultant value from the equation provided by @AastaLLL in this post, the normalized pixel range is totally different than the normalized pixel range in PyTorch. Substituting the said mean and factor [-0.393065, 0.39616] for channel 0.

So I tried another different approach, which is to reverse calculate the net-scale-factor and mean from the following equation with the expected result and the range of input value.

e.g: Channel 0

norm_pix = net-scale-factor * ( x - mean )
eqn 1 : -2.146 = net-scale-factor * (0 - mean)
eqn 2 : 2.279 = net-scale-factor * (255 - mean)

-2.146 = net-scale-factor * (-mean)
2.279 = 255(net-scale-factor) - net-scale-factor * (-mean)
255(net-scale-factor) = 4.425

Solving by linear equation, i would get net-scale-factor = 0.017353. Solving mean gives me 123.69

B. Image Resize
I am not particularly sure of this aspect. The nearest property in pre-processing block that I could try are below:

frame_scaling_hw: FRAME_SCALING_HW_DEFAULT
frame_scaling_filter: 1
# frame scaling using Bilinear (check NvBufSurfTransform_Inter enum)

Questions
My questions would be :

  1. Is the assumption of input pixel range [0, 255] to the pre-processing block valid?
  2. In (A. Normalization section), would solving via the linear equation to get the net-scale-factor and mean approach appropriate to be used?
  3. For layer transforms.Resize(cfg.resize), which part of preprocessing property should I modify to reflect the resize operation?

Any feedback and guidance is very much appreciated in reflecting this entire transform operation.

Thanks!

Hi,

A. Normalization

Actually, I was confused about the name of std.

The real equation used in the pyTorch is y = (x - mean) / std.
As a result, you can map it to Deepstream via setting mean=self.mean and net_scale_factor=1/self.std

The problem is that we don’t support channel-wise normalization.
So you can either to choose one value for the whole image or update the source code for 3 separate values here:

/opt/nvidia/deepstream/deepstream-5.0/sources/libs/nvdsinfer/nvdsinfer_context_impl.cpp

B. Image Resize

This procedure is to resize the input image into the same size of network input.
This is by default enabled in the Deepstream.

You can also find the implementation details in the same file shared above.

Thanks.

Hey @AastaLLL, thanks for referring me to the file! It makes more sense to me now. Based on the file, it seems that setting the values in the channels_offset does the operation of x-mean.

To help me understand thing a bit more, may i ask what is the pitch in the following code referring to?

__global__ void
NvDsInferConvert_C1ToP1FloatKernelWithMeanSubtraction(
        float *outBuffer,
        unsigned char *inBuffer,
        unsigned int width,
        unsigned int height,
        unsigned int pitch,
        float scaleFactor,
        float *meanDataBuffer)
{
    unsigned int row = blockIdx.y * blockDim.y + threadIdx.y;
    unsigned int col = blockIdx.x * blockDim.x + threadIdx.x;

    if (col < width && row < height)
    {
        outBuffer[row * width + col] =
            scaleFactor * ((float) inBuffer[row * pitch + col] -
            meanDataBuffer[(row * width) + col]);
    }
}

Hi,

Pitch specifies surface’s line width in bytes.
Usually, it equals to the width * byte_per_pixel.

Thanks.