Issue running Driveworks DNN on custom model

Hi,

I am trying to run and validate our model using driveworks DNN, I have validated the same model using Tensort RT and its working fine.

With Driveworks I am facing few issues which I need your support, our model has 6 input and 11 output, for now I am trying to feed file based input for validating the output.
I have doubt in passing 6 input to my model, when ever I try I am getting issue like channel descriptor not matching during dwDataConditioner_prepareData.

Driveworks exception thrown: CudaTextureObjectCache::getOrCreateTextureObject. cudaErrorInvalidChannelDescriptor: invalid channel descriptor

or

Driveworks exception thrown: DW_INVALID_ARGUMENT: DataConditioner: Given tensor’s properties do not match the expected properties.

Initialize DNN:

    {
        // If not specified, load the correct network based on platform
        std::string tensorRTModel = getArgument("tensorRT_model");
        if (tensorRTModel.empty())
        {
            tensorRTModel = dw_samples::DataPath::get() + "/samples/detector/";
            tensorRTModel += getPlatformPrefix();
            tensorRTModel += "/model32.bin";
        }

        // Initialize DNN from a TensorRT file
        CHECK_DW_ERROR(dwDNN_initializeTensorRTFromFile(&m_dnn, tensorRTModel.c_str(), nullptr,
                                                        DW_PROCESSOR_TYPE_GPU, m_sdk));

        CHECK_DW_ERROR(dwDNN_setCUDAStream(m_cudaStream, m_dnn));

        // Get input and output dimensions
        dwDNNTensorProperties inputProps[NUM_INPUT_TENSORS];
        dwDNNTensorProperties outputProps[NUM_OUTPUT_TENSORS];

        // Allocate input tensor
        for (uint32_t inputIdx = 0U; inputIdx < NUM_INPUT_TENSORS; ++inputIdx)
        {
            CHECK_DW_ERROR(dwDNN_getInputTensorProperties(&inputProps[inputIdx], inputIdx , m_dnn));
            printf("input tensor done\n");
            CHECK_DW_ERROR(dwDNNTensor_createNew(&m_dnnInput[inputIdx], &inputProps[inputIdx], m_sdk));
            printf("input tensor create done\n");
        }

        // Allocate outputs
        for (uint32_t outputIdx = 0U; outputIdx < NUM_OUTPUT_TENSORS; ++outputIdx)
        {
            CHECK_DW_ERROR(dwDNN_getOutputTensorProperties(&outputProps[outputIdx], outputIdx, m_dnn));
            // Allocate device tensors
            CHECK_DW_ERROR(dwDNNTensor_createNew(&m_dnnOutputsDevice[outputIdx], &outputProps[outputIdx], m_sdk));
            // Allocate host tensors
            dwDNNTensorProperties hostProps = outputProps[outputIdx];
            hostProps.tensorType            = DW_DNN_TENSOR_TYPE_CPU;

            // Allocate streamer
            CHECK_DW_ERROR(dwDNNTensorStreamer_initialize(&m_dnnOutputStreamers[outputIdx],
                                                          &outputProps[outputIdx],
                                                          hostProps.tensorType, m_sdk));
        }

        // Get coverage and bounding box blob indices
        //const char* coverageBlobName    = "coverage";
        //const char* boundingBoxBlobName = "bboxes";
        //CHECK_DW_ERROR(dwDNN_getOutputIndex(&m_cvgIdx, coverageBlobName, m_dnn));
        //CHECK_DW_ERROR(dwDNN_getOutputIndex(&m_bboxIdx, boundingBoxBlobName, m_dnn));

        // Get metadata from DNN module
        // DNN loads metadata automatically from json file stored next to the dnn model,
        // with the same name but additional .json extension if present.
        // Otherwise, the metadata will be filled with default values and the dataconditioner parameters
        // should be filled manually.
        dwDNNMetaData metadata;
        CHECK_DW_ERROR(dwDNN_getMetaData(&metadata, m_dnn));

        for (uint32_t i = 0U; i < NUM_INPUT_TENSORS; ++i){
            // Initialie data conditioner
            CHECK_DW_ERROR(dwDataConditioner_initializeFromTensorProperties(&m_dataConditioner, &inputProps[i], 1U,
                                                                            &metadata.dataConditionerParams, m_cudaStream,
                                                                            m_sdk));

            // Detection region
            m_detectionRegion[i].width = static_cast<uint32_t>(inputProps[i].dimensionSize[0]);
            m_detectionRegion[i].height = static_cast<uint32_t>(inputProps[i].dimensionSize[1]);
            m_detectionRegion[i].x = 0;//(m_imageWidth - m_detectionRegion.width) / 2;
            m_detectionRegion[i].y = 0;//(m_imageHeight - m_detectionRegion.height) / 2;
        }

        // Compute a pixel (cell) in output in relation to a pixel in input of the network
        uint32_t gridW = outputProps[0].dimensionSize[0];
        m_cellSize     = inputProps[0].dimensionSize[0] / gridW;
    }

Process:

    dwImageCUDA* currframecuda = nullptr;
    dwImageCUDA* encoder0cuda = nullptr;
    dwImageCUDA* encoder1cuda = nullptr;
    dwImageCUDA* encoder2cuda = nullptr;
    dwImageCUDA* encoder3cuda = nullptr;
    dwImageCUDA* encoder4cuda = nullptr;

    dwImageHandle_t currentFrame[6];
    dwImageProperties currentFrameProps;
    currentFrameProps.type = DW_IMAGE_CUDA;
    currentFrameProps.memoryLayout = DW_IMAGE_MEMORY_TYPE_PITCH;
    currentFrameProps.format = DW_IMAGE_FORMAT_RGB_FLOAT32;
    currentFrameProps.height = 224;
    currentFrameProps.width = 512;

    dwImageProperties encoder0Props;
    encoder0Props.type = DW_IMAGE_CUDA;
    encoder0Props.memoryLayout = DW_IMAGE_MEMORY_TYPE_PITCH;
    encoder0Props.format = DW_IMAGE_FORMAT_R_FLOAT32;
    encoder0Props.height = 112;
    encoder0Props.width = 256;

    dwImageProperties encoder1Props;
    encoder1Props.memoryLayout = DW_IMAGE_MEMORY_TYPE_PITCH;
    encoder1Props.type = DW_IMAGE_CUDA;
    encoder1Props.format = DW_IMAGE_FORMAT_R_FLOAT32;
    encoder1Props.height = 56;
    encoder1Props.width = 128;

    dwImageProperties encoder2Props;
    encoder2Props.memoryLayout = DW_IMAGE_MEMORY_TYPE_PITCH;
    encoder2Props.type = DW_IMAGE_CUDA;
    encoder2Props.format = DW_IMAGE_FORMAT_R_FLOAT32;
    encoder2Props.height = 28;
    encoder2Props.width = 64;

    dwImageProperties encoder3Props;
    encoder3Props.memoryLayout = DW_IMAGE_MEMORY_TYPE_PITCH;
    encoder3Props.type = DW_IMAGE_CUDA;
    encoder3Props.format = DW_IMAGE_FORMAT_R_FLOAT32;
    encoder3Props.height = 14;
    encoder3Props.width = 32;

    dwImageProperties encoder4Props;
    encoder4Props.memoryLayout = DW_IMAGE_MEMORY_TYPE_PITCH;
    encoder4Props.type = DW_IMAGE_CUDA;
    encoder4Props.format = DW_IMAGE_FORMAT_R_FLOAT32;
    encoder4Props.height = 7;
    encoder4Props.width = 16;

    float* hostDataBuffer = (float*)malloc(512 * 224 * 3 * sizeof(float));
    loadFileToBuffer("/home/nvidia/input_files/current_frame.raw", (uint8_t *)hostDataBuffer, 512*224*3*4);

    float* hostDataBuffer1 = (float*)malloc(256*112*64 * sizeof(float));
    loadFileToBuffer("/home/nvidia/input_files/i_encoder_0.raw", (uint8_t *)hostDataBuffer1, 256*112*64*4);

    float* hostDataBuffer2 = (float*)malloc(128*56*64 * sizeof(float));
    loadFileToBuffer("/home/nvidia/input_files/i_encoder_1.raw", (uint8_t *)hostDataBuffer2, 128*56*64*4);

    float* hostDataBuffer3 = (float*)malloc(64*28*128 * sizeof(float));
    loadFileToBuffer("/home/nvidia/input_files/i_encoder_2.raw", (uint8_t *)hostDataBuffer3, 64*28*128*4);

    float* hostDataBuffer4 = (float*)malloc(32*14*256* sizeof(float));
    loadFileToBuffer("/home/nvidia/input_files/i_encoder_3.raw", (uint8_t *)hostDataBuffer4, 32*14*256*4);

    float* hostDataBuffer5 = (float*)malloc(16*7*512 * sizeof(float));
    loadFileToBuffer("/home/nvidia/input_files/i_encoder_4.raw", (uint8_t *)hostDataBuffer5, 16*7*512*4);

    CHECK_DW_ERROR(dwImage_create(&currentFrame[0],currentFrameProps,m_sdk));

    CHECK_DW_ERROR(dwImage_getCUDA(&currframecuda,currentFrame[0]));
    cudaMemcpy2D(currframecuda->dptr[0], currframecuda->pitch[0], hostDataBuffer,
                 sizeof(uint8_t) * 512 * 3,
                 sizeof(uint8_t) * 512 * 3,
                 224, cudaMemcpyHostToDevice);

    CHECK_DW_ERROR(dwImage_create(&currentFrame[1],encoder0Props,m_sdk));
    CHECK_DW_ERROR(dwImage_getCUDA(&encoder0cuda,currentFrame[1]));

    cudaMemcpy2D(encoder0cuda->dptr[0], encoder0cuda->pitch[0], hostDataBuffer1,
                 sizeof(uint8_t) * 256 * 64,
                 sizeof(uint8_t) * 256 * 64,
                 112, cudaMemcpyHostToDevice);

    CHECK_DW_ERROR(dwImage_create(&currentFrame[2],encoder1Props,m_sdk));
    CHECK_DW_ERROR(dwImage_getCUDA(&encoder1cuda,currentFrame[2]));

    cudaMemcpy2D(encoder1cuda->dptr[0], encoder1cuda->pitch[0], hostDataBuffer2,
                 sizeof(uint8_t) * 128 * 64,
                 sizeof(uint8_t) * 128 * 64,
                 56, cudaMemcpyHostToDevice);

    CHECK_DW_ERROR(dwImage_create(&currentFrame[3],encoder2Props,m_sdk));

    CHECK_DW_ERROR(dwImage_getCUDA(&encoder2cuda,currentFrame[3]));

    cudaMemcpy2D(encoder2cuda->dptr[0], encoder2cuda->pitch[0], hostDataBuffer3,
                 sizeof(uint8_t) * 64 * 128,
                 sizeof(uint8_t) * 64 * 128,
                 28, cudaMemcpyHostToDevice);

    CHECK_DW_ERROR(dwImage_create(&currentFrame[4],encoder3Props,m_sdk));
    CHECK_DW_ERROR(dwImage_getCUDA(&encoder3cuda,currentFrame[4]));
    cudaMemcpy2D(encoder3cuda->dptr[0], encoder3cuda->pitch[0], hostDataBuffer4,
                 sizeof(uint8_t) * 32 * 256,
                 sizeof(uint8_t) * 32 * 256,
                 14, cudaMemcpyHostToDevice);

    CHECK_DW_ERROR(dwImage_create(&currentFrame[5],encoder4Props,m_sdk));
    CHECK_DW_ERROR(dwImage_getCUDA(&encoder4cuda,currentFrame[5]));
    cudaMemcpy2D(encoder4cuda->dptr[0], encoder4cuda->pitch[0], hostDataBuffer5,
                 sizeof(uint8_t) * 16 * 512,
                 sizeof(uint8_t) * 16 * 512,
                 7, cudaMemcpyHostToDevice);

    auto inputImgs = static_cast<dwImageHandle_t  const *>(currentFrame);
    CHECK_DW_ERROR(dwDataConditioner_prepareData(m_dnnInput[0], inputImgs, 6U, &m_detectionRegion[0],
                                                 cudaAddressModeClamp, m_dataConditioner));
    dwConstDNNTensorHandle_t inputs[1U] = {m_dnnInput[0]};
    CHECK_DW_ERROR(dwDNN_infer(m_dnnOutputsDevice, NUM_OUTPUT_TENSORS, inputs, 6U, m_dnn));

Please let me what I am missing, do we need to have 6 input tensor and 6 data conditioner handle or single is enough,

Dear @mukilan.vijayakumar,
Firstly, could you confirm if the model32.bin is generated using TensorRT_Optmization tool?

Dear Siva,

Yes, it is generated using TensorRT_Optmization tool, Model loaded fine I think.

Please find attached log image, got one warning,


DNN: Missing or incompatible parameter in metadata (tonemapType). Parameter is set to default value. See dwDataConditioner for default values.

Dear @mukilan.vijayakumar,
Have you filled the right entries for your model in corresponding .json file? If so, could you share the .json file? If possible, please share complete code and model to repro this on my host for more clarity?

Dear Siva,

Please find the json file, code and the model input and output information.
model32.bin.json (321 Bytes)
main.cpp (32.5 KB)

Dear Siva,

Can you please support here, I am kind of blocked.

Dear @mukilan.vijayakumar,
Could you share the model(ONNX or caffe) command used to generate optimized model.

Dear Siva,

you mean command how I generated model32.bin using TensorRT_Optmization tool?

yes

./tensorRT_optimization --modelType=onnx --onnxFile /home/nvidia/model.onnx --out /home/nvidia/test.bin


WARNING: Using default Logger, most probably DriveWorks
library was linked more than once.

DefaultLogger: [03-09-2021 08:46:26] DefaultLogger: WARNING: ExplicitBatch is enabled by default for ONNX models.
Initializing TensorRT generation on model /home/nvidia/model.onnx.

Input filename: /home/nvidia/model.onnx
ONNX IR version: 0.0.6
Opset version: 12
Producer name: pytorch
Producer version: 1.6
Domain:
Model version: 0
Doc string:

onnx2trt_utils.cpp:194: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
onnx2trt_utils.cpp:194: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
onnx2trt_utils.cpp:194: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
onnx2trt_utils.cpp:194: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
onnx2trt_utils.cpp:194: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
onnx2trt_utils.cpp:194: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
Tensor DataType is determined at build time for tensors not marked as input or output.
Tensor DataType is determined at build time for tensors not marked as input or output.
Calling isShapeTensor before the entire network is constructed may result in an inaccurate result.
Calling isShapeTensor before the entire network is constructed may result in an inaccurate result.
Input “current_frame”: 1x3x224x512
Input “i_encoder_0”: 1x64x112x256
Input “i_encoder_1”: 1x64x56x128
Input “i_encoder_2”: 1x128x28x64
Input “i_encoder_3”: 1x256x14x32
Input “i_encoder_4”: 1x512x7x16
Output “o_encoder_0”: 1x64x112x256
Output “o_encoder_1”: 1x64x56x128
Output “o_encoder_2”: 1x128x28x64
Output “o_encoder_3”: 1x256x14x32
Output “o_encoder_4”: 1x512x7x16
Output “depth”: -1x-1x-1x-1
Output “sem”: 1x224x512
Output “motion”: 1x224x512
Output “det1”: 1x30x7x16
Output “det2”: 1x30x14x32
Output “det3”: 1x30x28x64
Building Engine…
…/builder/virtualMachineSyntaxDAG.cpp (906) - Cuda Error in operator(): 0 (Number of registers exceeding the maximum limit, total 5 are needed, but only 4 are available)
…/builder/virtualMachineSyntaxDAG.cpp (906) - Cuda Error in operator(): 0 (Number of registers exceeding the maximum limit, total 5 are needed, but only 4 are available)
Explicit batch network detected and batch size specified, use enqueue without batch size instead.
Explicit batch network detected and batch size specified, use enqueue without batch size instead.
Explicit batch network detected and batch size specified, use enqueue without batch size instead.
Explicit batch network detected and batch size specified, use enqueue without batch size instead.
Explicit batch network detected and batch size specified, use enqueue without batch size instead.
Explicit batch network detected and batch size specified, use enqueue without batch size instead.
Iteration 0: 44.9525 ms.
Explicit batch network detected and batch size specified, use enqueue without batch size instead.
Iteration 1: 45.2897 ms.
Explicit batch network detected and batch size specified, use enqueue without batch size instead.
Iteration 2: 45.1352 ms.
Explicit batch network detected and batch size specified, use enqueue without batch size instead.
Iteration 3: 45.1212 ms.
Explicit batch network detected and batch size specified, use enqueue without batch size instead.
Iteration 4: 45.1438 ms.
Explicit batch network detected and batch size specified, use enqueue without batch size instead.
Iteration 5: 45.2138 ms.
Explicit batch network detected and batch size specified, use enqueue without batch size instead.
Iteration 6: 45.0432 ms.
Explicit batch network detected and batch size specified, use enqueue without batch size instead.
Iteration 7: 45.0903 ms.
Explicit batch network detected and batch size specified, use enqueue without batch size instead.
Iteration 8: 45.1281 ms.
Explicit batch network detected and batch size specified, use enqueue without batch size instead.
Iteration 9: 45.1161 ms.
CUDA graph OFF, Average over 10 runs is 45.1234 ms.

Depth model is not loaded properly I think.

Dear @mukilan.vijayakumar,
could you share your ONNX model

Dear Siva,

Sorry I am not supposed to share the model.
is there anyway/anything we can debug without sharing model?
since the same model we could run successfully with tensorRT sample, everything works fine.
Only with DNN we face issue.

Dear Siva,

Please let me know will DNN support input of type float32 RGB image.
I am always getting Driveworks exception thrown: CudaTextureObjectCache::getOrCreateTextureObject. cudaErrorInvalidChannelDescriptor: invalid channel descriptor

Dear @mukilan.vijayakumar,
As I don’t have sufficient data to reproduce, Could you try to preprocess the other ‘R’ buffers and confirms no issues with dwDataConditioner_prepareData()?

DNN support input of type float32 RGB image

Do you mean dwDataConditioner_prepareData accepts RGB buffers? Note that, if the input buffers of RGBA pixels, channel interleaving is allowed. But if the input buffer is only RGB, interleaving is not allowed. So I think you need to use DW_IMAGE_FORMAT_RGB_UINT8_PLANAR and create RGB buffers seperately and use cudaMemcpy2D to fill those buffers. Also, set doPerPlaneMeanNormalization as true.

Dear Siva,

I need to pass input as fp32 RGB, DW_IMAGE_FORMAT_RGB_FLOAT32 is supported by DNN,?

was it the reason why I get Driveworks exception thrown: CudaTextureObjectCache::getOrCreateTextureObject. cudaErrorInvalidChannelDescriptor: invalid channel descriptor.

Dear @mukilan.vijayakumar,
No. Please check using DW_IMAGE_FORMAT_RGB_FLOAT32_PLANAR type.

Dear Siva,

I have tried with DW_IMAGE_FORMAT_RGB_FLOAT32_PLANAR, still same issue.

Since facing issue with DNN, I am trying to link Driveworks sfm sample with our working tensorRT CNN code.
will this approach work for time being?

Camera capture video pipeline (Nvidia FW) and CNN using TensorRT sample.
Also Our TensorRT sample, we handled pre and post processing using openCV, will Driveworks support openCV?

Dear @mukilan.vijayakumar,
Did you check using TRT APIs with DW? Wondering why don’t you check using DNN RAW buffers instead of Tensor buffers. Any reason?