Issue running Driveworks DNN on custom model

mukilan.vijayakumar · August 27, 2021, 1:55pm

Hi,

I am trying to run and validate our model using driveworks DNN, I have validated the same model using Tensort RT and its working fine.

With Driveworks I am facing few issues which I need your support, our model has 6 input and 11 output, for now I am trying to feed file based input for validating the output.
I have doubt in passing 6 input to my model, when ever I try I am getting issue like channel descriptor not matching during dwDataConditioner_prepareData.

Driveworks exception thrown: CudaTextureObjectCache::getOrCreateTextureObject. cudaErrorInvalidChannelDescriptor: invalid channel descriptor

or

Driveworks exception thrown: DW_INVALID_ARGUMENT: DataConditioner: Given tensor’s properties do not match the expected properties.

Initialize DNN:

    {
        // If not specified, load the correct network based on platform
        std::string tensorRTModel = getArgument("tensorRT_model");
        if (tensorRTModel.empty())
        {
            tensorRTModel = dw_samples::DataPath::get() + "/samples/detector/";
            tensorRTModel += getPlatformPrefix();
            tensorRTModel += "/model32.bin";
        }

        // Initialize DNN from a TensorRT file
        CHECK_DW_ERROR(dwDNN_initializeTensorRTFromFile(&m_dnn, tensorRTModel.c_str(), nullptr,
                                                        DW_PROCESSOR_TYPE_GPU, m_sdk));

        CHECK_DW_ERROR(dwDNN_setCUDAStream(m_cudaStream, m_dnn));

        // Get input and output dimensions
        dwDNNTensorProperties inputProps[NUM_INPUT_TENSORS];
        dwDNNTensorProperties outputProps[NUM_OUTPUT_TENSORS];

        // Allocate input tensor
        for (uint32_t inputIdx = 0U; inputIdx < NUM_INPUT_TENSORS; ++inputIdx)
        {
            CHECK_DW_ERROR(dwDNN_getInputTensorProperties(&inputProps[inputIdx], inputIdx , m_dnn));
            printf("input tensor done\n");
            CHECK_DW_ERROR(dwDNNTensor_createNew(&m_dnnInput[inputIdx], &inputProps[inputIdx], m_sdk));
            printf("input tensor create done\n");
        }

        // Allocate outputs
        for (uint32_t outputIdx = 0U; outputIdx < NUM_OUTPUT_TENSORS; ++outputIdx)
        {
            CHECK_DW_ERROR(dwDNN_getOutputTensorProperties(&outputProps[outputIdx], outputIdx, m_dnn));
            // Allocate device tensors
            CHECK_DW_ERROR(dwDNNTensor_createNew(&m_dnnOutputsDevice[outputIdx], &outputProps[outputIdx], m_sdk));
            // Allocate host tensors
            dwDNNTensorProperties hostProps = outputProps[outputIdx];
            hostProps.tensorType            = DW_DNN_TENSOR_TYPE_CPU;

            // Allocate streamer
            CHECK_DW_ERROR(dwDNNTensorStreamer_initialize(&m_dnnOutputStreamers[outputIdx],
                                                          &outputProps[outputIdx],
                                                          hostProps.tensorType, m_sdk));
        }

        // Get coverage and bounding box blob indices
        //const char* coverageBlobName    = "coverage";
        //const char* boundingBoxBlobName = "bboxes";
        //CHECK_DW_ERROR(dwDNN_getOutputIndex(&m_cvgIdx, coverageBlobName, m_dnn));
        //CHECK_DW_ERROR(dwDNN_getOutputIndex(&m_bboxIdx, boundingBoxBlobName, m_dnn));

        // Get metadata from DNN module
        // DNN loads metadata automatically from json file stored next to the dnn model,
        // with the same name but additional .json extension if present.
        // Otherwise, the metadata will be filled with default values and the dataconditioner parameters
        // should be filled manually.
        dwDNNMetaData metadata;
        CHECK_DW_ERROR(dwDNN_getMetaData(&metadata, m_dnn));

        for (uint32_t i = 0U; i < NUM_INPUT_TENSORS; ++i){
            // Initialie data conditioner
            CHECK_DW_ERROR(dwDataConditioner_initializeFromTensorProperties(&m_dataConditioner, &inputProps[i], 1U,
                                                                            &metadata.dataConditionerParams, m_cudaStream,
                                                                            m_sdk));

            // Detection region
            m_detectionRegion[i].width = static_cast<uint32_t>(inputProps[i].dimensionSize[0]);
            m_detectionRegion[i].height = static_cast<uint32_t>(inputProps[i].dimensionSize[1]);
            m_detectionRegion[i].x = 0;//(m_imageWidth - m_detectionRegion.width) / 2;
            m_detectionRegion[i].y = 0;//(m_imageHeight - m_detectionRegion.height) / 2;
        }

        // Compute a pixel (cell) in output in relation to a pixel in input of the network
        uint32_t gridW = outputProps[0].dimensionSize[0];
        m_cellSize     = inputProps[0].dimensionSize[0] / gridW;
    }

Process:

    dwImageCUDA* currframecuda = nullptr;
    dwImageCUDA* encoder0cuda = nullptr;
    dwImageCUDA* encoder1cuda = nullptr;
    dwImageCUDA* encoder2cuda = nullptr;
    dwImageCUDA* encoder3cuda = nullptr;
    dwImageCUDA* encoder4cuda = nullptr;

    dwImageHandle_t currentFrame[6];
    dwImageProperties currentFrameProps;
    currentFrameProps.type = DW_IMAGE_CUDA;
    currentFrameProps.memoryLayout = DW_IMAGE_MEMORY_TYPE_PITCH;
    currentFrameProps.format = DW_IMAGE_FORMAT_RGB_FLOAT32;
    currentFrameProps.height = 224;
    currentFrameProps.width = 512;

    dwImageProperties encoder0Props;
    encoder0Props.type = DW_IMAGE_CUDA;
    encoder0Props.memoryLayout = DW_IMAGE_MEMORY_TYPE_PITCH;
    encoder0Props.format = DW_IMAGE_FORMAT_R_FLOAT32;
    encoder0Props.height = 112;
    encoder0Props.width = 256;

    dwImageProperties encoder1Props;
    encoder1Props.memoryLayout = DW_IMAGE_MEMORY_TYPE_PITCH;
    encoder1Props.type = DW_IMAGE_CUDA;
    encoder1Props.format = DW_IMAGE_FORMAT_R_FLOAT32;
    encoder1Props.height = 56;
    encoder1Props.width = 128;

    dwImageProperties encoder2Props;
    encoder2Props.memoryLayout = DW_IMAGE_MEMORY_TYPE_PITCH;
    encoder2Props.type = DW_IMAGE_CUDA;
    encoder2Props.format = DW_IMAGE_FORMAT_R_FLOAT32;
    encoder2Props.height = 28;
    encoder2Props.width = 64;

    dwImageProperties encoder3Props;
    encoder3Props.memoryLayout = DW_IMAGE_MEMORY_TYPE_PITCH;
    encoder3Props.type = DW_IMAGE_CUDA;
    encoder3Props.format = DW_IMAGE_FORMAT_R_FLOAT32;
    encoder3Props.height = 14;
    encoder3Props.width = 32;

    dwImageProperties encoder4Props;
    encoder4Props.memoryLayout = DW_IMAGE_MEMORY_TYPE_PITCH;
    encoder4Props.type = DW_IMAGE_CUDA;
    encoder4Props.format = DW_IMAGE_FORMAT_R_FLOAT32;
    encoder4Props.height = 7;
    encoder4Props.width = 16;

    float* hostDataBuffer = (float*)malloc(512 * 224 * 3 * sizeof(float));
    loadFileToBuffer("/home/nvidia/input_files/current_frame.raw", (uint8_t *)hostDataBuffer, 512*224*3*4);

    float* hostDataBuffer1 = (float*)malloc(256*112*64 * sizeof(float));
    loadFileToBuffer("/home/nvidia/input_files/i_encoder_0.raw", (uint8_t *)hostDataBuffer1, 256*112*64*4);

    float* hostDataBuffer2 = (float*)malloc(128*56*64 * sizeof(float));
    loadFileToBuffer("/home/nvidia/input_files/i_encoder_1.raw", (uint8_t *)hostDataBuffer2, 128*56*64*4);

    float* hostDataBuffer3 = (float*)malloc(64*28*128 * sizeof(float));
    loadFileToBuffer("/home/nvidia/input_files/i_encoder_2.raw", (uint8_t *)hostDataBuffer3, 64*28*128*4);

    float* hostDataBuffer4 = (float*)malloc(32*14*256* sizeof(float));
    loadFileToBuffer("/home/nvidia/input_files/i_encoder_3.raw", (uint8_t *)hostDataBuffer4, 32*14*256*4);

    float* hostDataBuffer5 = (float*)malloc(16*7*512 * sizeof(float));
    loadFileToBuffer("/home/nvidia/input_files/i_encoder_4.raw", (uint8_t *)hostDataBuffer5, 16*7*512*4);

    CHECK_DW_ERROR(dwImage_create(&currentFrame[0],currentFrameProps,m_sdk));

    CHECK_DW_ERROR(dwImage_getCUDA(&currframecuda,currentFrame[0]));
    cudaMemcpy2D(currframecuda->dptr[0], currframecuda->pitch[0], hostDataBuffer,
                 sizeof(uint8_t) * 512 * 3,
                 sizeof(uint8_t) * 512 * 3,
                 224, cudaMemcpyHostToDevice);

    CHECK_DW_ERROR(dwImage_create(&currentFrame[1],encoder0Props,m_sdk));
    CHECK_DW_ERROR(dwImage_getCUDA(&encoder0cuda,currentFrame[1]));

    cudaMemcpy2D(encoder0cuda->dptr[0], encoder0cuda->pitch[0], hostDataBuffer1,
                 sizeof(uint8_t) * 256 * 64,
                 sizeof(uint8_t) * 256 * 64,
                 112, cudaMemcpyHostToDevice);

    CHECK_DW_ERROR(dwImage_create(&currentFrame[2],encoder1Props,m_sdk));
    CHECK_DW_ERROR(dwImage_getCUDA(&encoder1cuda,currentFrame[2]));

    cudaMemcpy2D(encoder1cuda->dptr[0], encoder1cuda->pitch[0], hostDataBuffer2,
                 sizeof(uint8_t) * 128 * 64,
                 sizeof(uint8_t) * 128 * 64,
                 56, cudaMemcpyHostToDevice);

    CHECK_DW_ERROR(dwImage_create(&currentFrame[3],encoder2Props,m_sdk));

    CHECK_DW_ERROR(dwImage_getCUDA(&encoder2cuda,currentFrame[3]));

    cudaMemcpy2D(encoder2cuda->dptr[0], encoder2cuda->pitch[0], hostDataBuffer3,
                 sizeof(uint8_t) * 64 * 128,
                 sizeof(uint8_t) * 64 * 128,
                 28, cudaMemcpyHostToDevice);

    CHECK_DW_ERROR(dwImage_create(&currentFrame[4],encoder3Props,m_sdk));
    CHECK_DW_ERROR(dwImage_getCUDA(&encoder3cuda,currentFrame[4]));
    cudaMemcpy2D(encoder3cuda->dptr[0], encoder3cuda->pitch[0], hostDataBuffer4,
                 sizeof(uint8_t) * 32 * 256,
                 sizeof(uint8_t) * 32 * 256,
                 14, cudaMemcpyHostToDevice);

    CHECK_DW_ERROR(dwImage_create(&currentFrame[5],encoder4Props,m_sdk));
    CHECK_DW_ERROR(dwImage_getCUDA(&encoder4cuda,currentFrame[5]));
    cudaMemcpy2D(encoder4cuda->dptr[0], encoder4cuda->pitch[0], hostDataBuffer5,
                 sizeof(uint8_t) * 16 * 512,
                 sizeof(uint8_t) * 16 * 512,
                 7, cudaMemcpyHostToDevice);

    auto inputImgs = static_cast<dwImageHandle_t  const *>(currentFrame);
    CHECK_DW_ERROR(dwDataConditioner_prepareData(m_dnnInput[0], inputImgs, 6U, &m_detectionRegion[0],
                                                 cudaAddressModeClamp, m_dataConditioner));
    dwConstDNNTensorHandle_t inputs[1U] = {m_dnnInput[0]};
    CHECK_DW_ERROR(dwDNN_infer(m_dnnOutputsDevice, NUM_OUTPUT_TENSORS, inputs, 6U, m_dnn));

Please let me what I am missing, do we need to have 6 input tensor and 6 data conditioner handle or single is enough,

SivaRamaKrishnaNV · August 27, 2021, 4:00pm

Dear @mukilan.vijayakumar,
Firstly, could you confirm if the model32.bin is generated using TensorRT_Optmization tool?

mukilan.vijayakumar · August 27, 2021, 4:05pm

Dear Siva,

Yes, it is generated using TensorRT_Optmization tool, Model loaded fine I think.

mukilan.vijayakumar · August 27, 2021, 4:12pm

Please find attached log image, got one warning,

DNN: Missing or incompatible parameter in metadata (tonemapType). Parameter is set to default value. See dwDataConditioner for default values.

SivaRamaKrishnaNV · August 31, 2021, 11:21am

Dear @mukilan.vijayakumar,
Have you filled the right entries for your model in corresponding .json file? If so, could you share the .json file? If possible, please share complete code and model to repro this on my host for more clarity?

mukilan.vijayakumar · September 1, 2021, 5:57am

Dear Siva,

Please find the json file, code and the model input and output information.
model32.bin.json (321 Bytes)
main.cpp (32.5 KB)

mukilan.vijayakumar · September 28, 2021, 7:46am

Dear Siva,

Can you please support here, I am kind of blocked.

SivaRamaKrishnaNV · September 28, 2021, 8:31am

Dear @mukilan.vijayakumar,
Could you share the model(ONNX or caffe) command used to generate optimized model.

mukilan.vijayakumar · September 28, 2021, 8:34am

Dear Siva,

you mean command how I generated model32.bin using TensorRT_Optmization tool?

SivaRamaKrishnaNV · September 28, 2021, 8:35am

yes

mukilan.vijayakumar · September 28, 2021, 9:10am

./tensorRT_optimization --modelType=onnx --onnxFile /home/nvidia/model.onnx --out /home/nvidia/test.bin

WARNING: Using default Logger, most probably DriveWorks
library was linked more than once.

DefaultLogger: [03-09-2021 08:46:26] DefaultLogger: WARNING: ExplicitBatch is enabled by default for ONNX models.
Initializing TensorRT generation on model /home/nvidia/model.onnx.

Input filename: /home/nvidia/model.onnx
ONNX IR version: 0.0.6
Opset version: 12
Producer name: pytorch
Producer version: 1.6
Domain:
Model version: 0
Doc string:

onnx2trt_utils.cpp:194: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
onnx2trt_utils.cpp:194: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
onnx2trt_utils.cpp:194: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
onnx2trt_utils.cpp:194: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
onnx2trt_utils.cpp:194: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
onnx2trt_utils.cpp:194: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
Tensor DataType is determined at build time for tensors not marked as input or output.
Tensor DataType is determined at build time for tensors not marked as input or output.
Calling isShapeTensor before the entire network is constructed may result in an inaccurate result.
Calling isShapeTensor before the entire network is constructed may result in an inaccurate result.
Input “current_frame”: 1x3x224x512
Input “i_encoder_0”: 1x64x112x256
Input “i_encoder_1”: 1x64x56x128
Input “i_encoder_2”: 1x128x28x64
Input “i_encoder_3”: 1x256x14x32
Input “i_encoder_4”: 1x512x7x16
Output “o_encoder_0”: 1x64x112x256
Output “o_encoder_1”: 1x64x56x128
Output “o_encoder_2”: 1x128x28x64
Output “o_encoder_3”: 1x256x14x32
Output “o_encoder_4”: 1x512x7x16
Output “depth”: -1x-1x-1x-1
Output “sem”: 1x224x512
Output “motion”: 1x224x512
Output “det1”: 1x30x7x16
Output “det2”: 1x30x14x32
Output “det3”: 1x30x28x64
Building Engine…
…/builder/virtualMachineSyntaxDAG.cpp (906) - Cuda Error in operator(): 0 (Number of registers exceeding the maximum limit, total 5 are needed, but only 4 are available)
…/builder/virtualMachineSyntaxDAG.cpp (906) - Cuda Error in operator(): 0 (Number of registers exceeding the maximum limit, total 5 are needed, but only 4 are available)
Explicit batch network detected and batch size specified, use enqueue without batch size instead.
Explicit batch network detected and batch size specified, use enqueue without batch size instead.
Explicit batch network detected and batch size specified, use enqueue without batch size instead.
Explicit batch network detected and batch size specified, use enqueue without batch size instead.
Explicit batch network detected and batch size specified, use enqueue without batch size instead.
Explicit batch network detected and batch size specified, use enqueue without batch size instead.
Iteration 0: 44.9525 ms.
Explicit batch network detected and batch size specified, use enqueue without batch size instead.
Iteration 1: 45.2897 ms.
Explicit batch network detected and batch size specified, use enqueue without batch size instead.
Iteration 2: 45.1352 ms.
Explicit batch network detected and batch size specified, use enqueue without batch size instead.
Iteration 3: 45.1212 ms.
Explicit batch network detected and batch size specified, use enqueue without batch size instead.
Iteration 4: 45.1438 ms.
Explicit batch network detected and batch size specified, use enqueue without batch size instead.
Iteration 5: 45.2138 ms.
Explicit batch network detected and batch size specified, use enqueue without batch size instead.
Iteration 6: 45.0432 ms.
Explicit batch network detected and batch size specified, use enqueue without batch size instead.
Iteration 7: 45.0903 ms.
Explicit batch network detected and batch size specified, use enqueue without batch size instead.
Iteration 8: 45.1281 ms.
Explicit batch network detected and batch size specified, use enqueue without batch size instead.
Iteration 9: 45.1161 ms.
CUDA graph OFF, Average over 10 runs is 45.1234 ms.

Depth model is not loaded properly I think.

SivaRamaKrishnaNV · September 28, 2021, 9:11am

Dear @mukilan.vijayakumar,
could you share your ONNX model

mukilan.vijayakumar · September 28, 2021, 9:39am

Dear Siva,

Sorry I am not supposed to share the model.
is there anyway/anything we can debug without sharing model?
since the same model we could run successfully with tensorRT sample, everything works fine.
Only with DNN we face issue.

mukilan.vijayakumar · September 29, 2021, 4:37pm

Dear Siva,

Please let me know will DNN support input of type float32 RGB image.
I am always getting Driveworks exception thrown: CudaTextureObjectCache::getOrCreateTextureObject. cudaErrorInvalidChannelDescriptor: invalid channel descriptor

SivaRamaKrishnaNV · September 30, 2021, 10:04am

Dear @mukilan.vijayakumar,
As I don’t have sufficient data to reproduce, Could you try to preprocess the other ‘R’ buffers and confirms no issues with dwDataConditioner_prepareData()?

DNN support input of type float32 RGB image

Do you mean dwDataConditioner_prepareData accepts RGB buffers? Note that, if the input buffers of RGBA pixels, channel interleaving is allowed. But if the input buffer is only RGB, interleaving is not allowed. So I think you need to use DW_IMAGE_FORMAT_RGB_UINT8_PLANAR and create RGB buffers seperately and use cudaMemcpy2D to fill those buffers. Also, set doPerPlaneMeanNormalization as true.

mukilan.vijayakumar · September 30, 2021, 2:10pm

Dear Siva,

I need to pass input as fp32 RGB, DW_IMAGE_FORMAT_RGB_FLOAT32 is supported by DNN,?

was it the reason why I get Driveworks exception thrown: CudaTextureObjectCache::getOrCreateTextureObject. cudaErrorInvalidChannelDescriptor: invalid channel descriptor.

SivaRamaKrishnaNV · October 4, 2021, 4:04pm

Dear @mukilan.vijayakumar,
No. Please check using DW_IMAGE_FORMAT_RGB_FLOAT32_PLANAR type.

mukilan.vijayakumar · October 12, 2021, 10:06am

Dear Siva,

I have tried with DW_IMAGE_FORMAT_RGB_FLOAT32_PLANAR, still same issue.

Since facing issue with DNN, I am trying to link Driveworks sfm sample with our working tensorRT CNN code.
will this approach work for time being?

Camera capture video pipeline (Nvidia FW) and CNN using TensorRT sample.
Also Our TensorRT sample, we handled pre and post processing using openCV, will Driveworks support openCV?

SivaRamaKrishnaNV · October 19, 2021, 9:39am

Dear @mukilan.vijayakumar,
Did you check using TRT APIs with DW? Wondering why don’t you check using DNN RAW buffers instead of Tensor buffers. Any reason?

Topic		Replies	Views
Using custom TensorRT model on sample applications DRIVE AGX Xavier General driveworks-dnn-framework	4	1206	March 22, 2022
Driveworks dnn custom model for objects detection (yolo) DRIVE AGX Xavier General driveworks	13	2073	August 5, 2022
Assertion Error in buildMemGraph: 0 (mg.nodes[mg.regionIndices[outputRegion]].size == mg.nodes[mg.regionIndices[inputRegion]].size) TensorRT	10	1294	October 12, 2021
I am trying to convert the ONNX SSD mobilnet v3 model into TensorRT Engine. I am getting the below error Jetson TX2 tensorrt , tensorflow	24	3710	February 17, 2022
I do not get any performance improvement after using TensorRT provider for object detection model Jetson Nano tensorrt , onnx	7	1415	July 12, 2022
[TensorRT] ERROR: Network must have at least one output [TensorRT] ERROR: Network validation failed TensorRT tensorrt , cuda , onnx	10	2778	October 16, 2020
Using Custom action recognition Model in Deepstream 3D action recognition and Getting Error TAO Toolkit	70	929	December 12, 2023
Failed to used TensorRT Engine file in deepstream DeepStream SDK	16	2766	October 12, 2021
ONNX model with Jetson-Inference using GPU Jetson Xavier NX tensorrt , jetson-inference , onnx	38	5641	October 18, 2021
Batch-size with 9 rtsp streams DeepStream SDK hw , cuda , gstreamer	16	1691	October 12, 2021

Issue running Driveworks DNN on custom model

WARNING: Using default Logger, most probably DriveWorks library was linked more than once.

DefaultLogger: [03-09-2021 08:46:26] DefaultLogger: WARNING: ExplicitBatch is enabled by default for ONNX models. Initializing TensorRT generation on model /home/nvidia/model.onnx.

Input filename: /home/nvidia/model.onnx ONNX IR version: 0.0.6 Opset version: 12 Producer name: pytorch Producer version: 1.6 Domain: Model version: 0 Doc string:

Related topics

WARNING: Using default Logger, most probably DriveWorks
library was linked more than once.

DefaultLogger: [03-09-2021 08:46:26] DefaultLogger: WARNING: ExplicitBatch is enabled by default for ONNX models.
Initializing TensorRT generation on model /home/nvidia/model.onnx.

Input filename: /home/nvidia/model.onnx
ONNX IR version: 0.0.6
Opset version: 12
Producer name: pytorch
Producer version: 1.6
Domain:
Model version: 0
Doc string: