DW Error DW_CUDA_ERROR executing DW function --> dwImageStreamer_initialize

Hello,
I’m trying to use Drivenet but I have a problem when I try to run dwImageStreamer_initialize.

I get this error:
DW Error DW_CUDA_ERROR executing DW function: dwImageStreamer_initialize(&m_streamerCPU2CUDA, &cpu_prop, DW_IMAGE_CUDA, m_context)

this is my streamer initialization:

void DriveNetInference::ImageStreamer(ImageData image_){

    cudaStreamCreateWithFlags(&this->cudaStream, cudaStreamNonBlocking);

    // CPU image with RGB format
    cpu_prop.type= DW_IMAGE_CPU;
    cpu_prop.width = image_.width;
    cpu_prop.height = image_.height;
    cpu_prop.format = DW_IMAGE_FORMAT_RGB_UINT8;

    CHECK_DW_ERROR(dwImage_create(&m_rgbCPU, cpu_prop, m_context));

    dwImage_getCPU(&this->imageCPU, m_rgbCPU);

    RCLCPP_INFO(rclcpp::get_logger("rclcpp"), "Get CPU");

    this->imageCPU->data[0] = &image_.data[0];
    this->imageCPU->prop = cpu_prop;

    CHECK_DW_ERROR(dwImageStreamer_initialize(&m_streamerCPU2CUDA, &cpu_prop, DW_IMAGE_CUDA, m_context));
 
}

Can someone help me to know the meaning of the error?

Thanks in advance

Software Version
NVIDIA DRIVE™ Software 10.0 (Linux)

Target Operating System
Linux

Hardware Platform
NVIDIA DRIVE™ AGX Xavier DevKit (E3550)

SDK Manager Version
1.5.0.7774

Host Machine Version
native Ubuntu 18.04

Dear @daniel.yustegalvez,
That error indicates issue with GPU when initializing image streamer. Either the usage is wrong or GPU is not working. If you are able to run the shipped Drivenet sample on target, then the issue is in the Image Streamer usage. Please check Simple Image Streamer sample for CPU ->CUDA- > GL use case.

I think my problem is that I can’t transform my CPU image to a CUDA image.

I’m creating manually the CPU image because I have the data in a ROS message.

    dwImage_getCPU(&this->imageCPU, m_rgbCPU);

    RCLCPP_INFO(rclcpp::get_logger("rclcpp"), "Get CPU");

    this->imageCPU->data[0] = &image_.data[0];
    this->imageCPU->prop = cpu_prop;

I’m not sure how to pass from dwImageCPU to dwImageCUDA. And I don’t know how to use this dwImageCPU using the image streamer. I can’t see the relation.

There is a method to transform directly from Image_CPU to Image_CUDA?

Dear @daniel.yustegalvez,
Using ImageStreamer would remove unneccessary memory copies. But, If you want to quickly check the data, you may manually copy CPU data to GPU buffer in dwImageCUDA. In that case, you need to use cudaMemcpy* function from CUDA Runtime API :: CUDA Toolkit Documentation.

Otherwise, fill the CPU buffer manually instead of via ROS and check using Imagestreamer for verification? Try to leverage the code from Simple ImageStreamer sample. You may replace the CPU data fill with ROS later once the pipeline is setup

To implement the first solution I have the following:

CPU image

    dwImageCPU* imageCPU;
    CHECK_DW_ERROR(dwImage_getCPU(&imageCPU, m_rgbCPU));
    imageCPU->data[0] = &image_.data[0];
    imageCPU->prop = cpu_prop;

CUDA image

    dwImageCUDA* imageCUDA;
    dwImageProperties cudaProp = cpu_prop;
    cudaProp.type = DW_IMAGE_CUDA;
    cudaProp.format = DW_IMAGE_FORMAT_RGBA_UINT8;
    dwImage_create(&m_rgbaCUDA, cudaProp, sdk);
    dwImage_getCUDA(&imageCUDA, this->m_rgbaCUDA);
    imageCUDA->prop = cudaProp;

CPU to CUDA

    cudaMemcpy(imageCUDA->array[0], imageCPU->data[0], image_.width*image_.height*3, cudaMemcpyHostToDevice);
    dwObjectDetector_detectObjects(&this->detections, imageCUDA, this->detectorHandle); 

image_ has the following structure:

typedef struct{
    uint8_t* data;
    int width;
    int height;
    std::string encoding;
    bool endianage;
    int Bpp; //Bytes per pixel
} ImageData;

Is this a good use of the function cudaMemcpy? The detection result is always 0 and I think that I’m using incorrectly the cudaMemcpy function.

Dear @daniel.yustegalvez,
I notice cudaProp.format= DW_IMAGE_FORMAT_RGBA_UINT8. But the buffer size in cudamemcpy is image_.width*image_.height*3. Also, double check the data arrangement(if it is RGBARGBA or RRR…GGG…BBB…AAA.) in the CPU buffer?

Is the input CPU buffer is RGB or RGBA?

This is the cpu configuration:

    cpu_prop.type = DW_IMAGE_CPU;
    cpu_prop.width = 480;
    cpu_prop.height = 302;
    cpu_prop.format = DW_IMAGE_FORMAT_RGB_UINT8;

    dwImageCPU* imageCPU;
    CHECK_DW_ERROR(dwImage_create(&m_rgbCPU, cpu_prop, sdk));
    CHECK_DW_ERROR(dwImage_getCPU(&imageCPU, m_rgbCPU));
    imageCPU->data[0] = &image_.data[0];
    imageCPU->prop = cpu_prop;

And the composition is RGBRGBRGB…

do you see something strange in the code?

Dear @SivaRamaKrishnaNV. Is it mandatory to inference the DriveNet Network with a planar image and FP16?
And about the last code, do you see something strange?

Thanks in advance.

Hi @daniel.yustegalvez ,

Please refer to sample_image_streamer_simple source (/usr/local/driveworks/samples/src/image/image_streamer_simple/main.cpp) and check first if the error in dwImageStreamer_initialize() is caused by below.

    imageCPU->data[0] = &image_.data[0];
    imageCPU->prop = cpu_prop;

Dear @daniel.yustegalvez ,
Is it mandatory to inference the DriveNet Network with a planar image and FP16?

Hello @SivaRamaKrishnaNV, @VickNV, thanks for your answers.

I’ve done some changes in the code but it isn’t still working. Here my code:

    //CPU image 480x302
    cpu_prop.type = DW_IMAGE_CPU;
    cpu_prop.width = 480;
    cpu_prop.height = 302;
    cpu_prop.format = DW_IMAGE_FORMAT_RGB_UINT16_PLANAR;

    //CUDA image 480x302
    cudaProp16.type = DW_IMAGE_CUDA;
    cudaProp16.width = 480;
    cudaProp16.height = 302;
    cudaProp16.format = DW_IMAGE_FORMAT_RGB_UINT16_PLANAR;

    //CUDA image 960x480 --> Type and resolution needed to use DriveNet
    cudaProp16_big.type = DW_IMAGE_CUDA;
    cudaProp16_big.width = 960;
    cudaProp16_big.height = 480;
    cudaProp16_big.format = DW_IMAGE_FORMAT_RGB_UINT16_PLANAR;

    CHECK_DW_ERROR(dwImage_create(&m_rgbCPU, cpu_prop, sdk));
    CHECK_DW_ERROR(dwImage_create(&rgbCUDA16, cudaProp16, sdk));
    CHECK_DW_ERROR(dwImage_create(&rgbCUDA16_big, cudaProp16_big, sdk));

Adding data manually to image CPU. (I have the correct data in ImageCPU, by this I think this part is correct)

dwImageCPU* imageCPU = nullptr;
    CHECK_DW_ERROR(dwImage_getCPU(&imageCPU, m_rgbCPU));
    uint32_t index_s;
    std::vector<uint8_t> R;
    std::vector<uint8_t> G;
    std::vector<uint8_t> B;
    for (uint32_t row = 0 ; row < image_.height ; row++) {
        for (uint32_t col = 0 ; col < image_.width ; col++) {
            index_s = 3*(row*image_.width+col);
            R.push_back(image_.data[index_s+0]);
            G.push_back(image_.data[index_s+1]);
            B.push_back(image_.data[index_s+2]);
        }
    }
    imageCPU->prop = cpu_prop;
    imageCPU->data[0] = &R[0];
    imageCPU->data[1] = &G[0];
    imageCPU->data[2] = &B[0];
    imageCPU->pitch[0] = image_.width; 
    imageCPU->pitch[1] = image_.width; 
    imageCPU->pitch[2] = image_.width;

Copy the data to CUDA image and resize to get an image CUDA 960x480

dwImageCUDA* imageCUDA16 = nullptr;
    dwImage_getCUDA(&imageCUDA16, rgbCUDA16);
    imageCUDA16->prop = cudaProp16;

    cudaMemcpy2DToArray ( imageCUDA16->array[0], 0, 0, &imageCPU->data[0], 480, 480, 302,  cudaMemcpyHostToDevice) ;
    cudaMemcpy2DToArray ( imageCUDA16->array[1], 0, 0, &imageCPU->data[1], 480, 480, 302,  cudaMemcpyHostToDevice) ;
    cudaMemcpy2DToArray ( imageCUDA16->array[2], 0, 0, &imageCPU->data[2], 480, 480, 302,  cudaMemcpyHostToDevice) ;

    params.ignoreAspectRatio = false;
    CHECK_DW_ERROR(dwImageTransformation_initialize(&this->imageTransformationEngine, params, sdk));
    CHECK_DW_ERROR(dwImageTransformation_setBorderMode(DW_IMAGEPROCESSING_BORDER_MODE_REPEAT, this->imageTransformationEngine));

    CHECK_DW_ERROR(dwImageTransformation_copyFullImage(rgbCUDA16_big, rgbCUDA16, this->imageTransformationEngine));
    dwImageCUDA* imageCUDA16_big = nullptr;
    CHECK_DW_ERROR(dwImage_getCUDA(&imageCUDA16_big, rgbCUDA16_big));

If I return to copy the data to a CPU image 960x480, the data is always 0. I think that I have an error during the resize but I can’t find it.

Can you help me with it?

Thanks in advance

And… How can I see the data in dwImageCUDA?

Hello @SivaRamaKrishnaNV,

How can I copy an image RGB_UINT8_PLANAR to RGB_FLOAT16_PLANAR?

dwImage_copyConvert()
&
dwImageTransformation_copyFullImage()

Thanks in advance

Well, here a possible solution for the problems commented in this post.

construct CPU image from ROS2 image RGBRGBRGB…(sensors:msgs::msg:image)
cpu_prop.type = DW_IMAGE_CPU;

cpu_prop.width = 480;
    

cpu_prop.height = 302;
cpu_prop.format = DW_IMAGE_FORMAT_RGB_UINT8_PLANAR;
CHECK_DW_ERROR(dwImage_create(&m_rgbCPU, cpu_prop, sdk));

std::vector<uint8_t> R;
std::vector<uint8_t> G;
std::vector<uint8_t> B;
for (uint32_t row = 0 ; row < image_.height ; row++) {
        for (uint32_t col = 0 ; col < image_.width ; col++) {
            index_s = 3*(row*image_.width+col);
            R.push_back(image_.data[index_s+0]);
            G.push_back(image_.data[index_s+1]);
            B.push_back(image_.data[index_s+2]);
        }
    }
imageCPU->prop = cpu_prop;
imageCPU->data[0] = &R[0];
imageCPU->data[1] = &G[0];
imageCPU->data[2] = &B[0];
imageCPU->pitch[0] = image_.width; 
imageCPU->pitch[1] = image_.width; 
imageCPU->pitch[2] = image_.width;

Transform to CUDA Image with the same properties RGB UINT8 PLANAR

dwImageCUDA* imageCUDA16 = nullptr;
CHECK_CUDA_ERROR(dwImage_getCUDA(&imageCUDA16, rgbCUDA16));
imageCUDA16->prop = cudaProp16;

CHECK_DW_ERROR(dwImageStreamer_producerSend(m_rgbCPU, m_streamerCPU2CUDA));
RCLCPP_INFO(rclcpp::get_logger("rclcpp"), "Produced");
CHECK_DW_ERROR(dwImageStreamer_consumerReceive(&rgbCUDA16, 5000, m_streamerCPU2CUDA));

Convert CUDA image from RGB UINT8 PLANAR to RGBA UINT8 to RGB FLOAT16 PLANAR (This is the DriveNet format needed)

CHECK_DW_ERROR(dwImage_copyConvert(rgbCUDA16p, rgbCUDA16, sdk));
CHECK_DW_ERROR(dwImage_copyConvert(rgbCUDA16pp, rgbCUDA16p, sdk));

Resize the image to the size needed for DriveNet

CHECK_DW_ERROR(dwImageTransformation_copyFullImage(rgbCUDA16_big, rgbCUDA16pp, this->imageTransformationEngine));

I’ve needed all this transformation to get the image in the correct format to use it on the DriveNet. (Maybe there is another way easier and faster but I couldn’t found it)

And now… I have a new problem @SivaRamaKrishnaNV, After some iterations, the process eds because every iterations increase the CUDA memory used. When this memory arrive to the limit, the process ends.
Do you know which can be the problem? I’m cleaning the streamer and destroying the image after use it.

CHECK_DW_ERROR(dwImageStreamer_consumerReturn(&rgbCUDA16, m_streamerCPU2CUDA));
CHECK_DW_ERROR(dwImageStreamer_producerReturn(&m_rgbCPU, 5000, m_streamerCPU2CU    if (rgbCUDA16p){
    if (rgbCUDA16p){
        CHECK_DW_ERROR(dwImage_destroy(rgbCUDA16p));
    }
    if (rgbCUDA16pp){
        CHECK_DW_ERROR(dwImage_destroy(rgbCUDA16pp));
    }
...
...

Thanks in advance!

Dear @daniel.yustegalvez,
Glad to hear that you made progress. Image Streamer does not create extra memory.
It looks like some CPU/GPU memory creation inside a loop and requires clean up.

As you say @SivaRamaKrishnaNV we need some “requires clean up” but we don’t know where. Then I tell you the steps done and the errors that I receive every time I run it
At the beginning:
CHECK_DW_ERROR(dwImage_create(&m_rgbCPU, cpu_prop, sdk));
CHECK_DW_ERROR(dwImage_create(&rgbCUDA16, cudaProp16, sdk));
dwImageCPU* imageCPU = nullptr;
CHECK_DW_ERROR(dwImage_getCPU(&imageCPU, m_rgbCPU));
uint32_t index_s;

std::vector<uint8_t> R;
std::vector<uint8_t> G;
std::vector<uint8_t> B;
for (uint32_t row = 0 ; row < image_.height ; row++) {
    for (uint32_t col = 0 ; col < image_.width ; col++) {
        index_s = 3*(row*image_.width+col);
        R.push_back(image_.data[index_s+0]);
        G.push_back(image_.data[index_s+1]);
        B.push_back(image_.data[index_s+2]);
    }
}
imageCPU->prop = cpu_prop;
imageCPU->data[0] = &R[0];
imageCPU->data[1] = &G[0];
imageCPU->data[2] = &B[0];
imageCPU->pitch[0] = image_.width; 
imageCPU->pitch[1] = image_.width; 
imageCPU->pitch[2] = image_.width;

dwImageCUDA* imageCUDA16 = nullptr;
CHECK_CUDA_ERROR(dwImage_getCUDA(&imageCUDA16, rgbCUDA16));
imageCUDA16->prop = cudaProp16;

And in the clean function:
CHECK_DW_ERROR(dwImage_destroy(m_rgbCPU));
CHECK_DW_ERROR(dwImage_destroy(rgbCUDA16));

But when I added

CHECK_DW_ERROR(dwImageStreamer_producerSend(m_rgbCPU, m_streamerCPU2CUDA));
CHECK_DW_ERROR(dwImageStreamer_consumerReceive(&rgbCUDA16, 2000, m_streamerCPU2CUDA));

The execution got have an error
free(): invalid pointer

So I added in the same funtion of clean:
CHECK_DW_ERROR(dwImageStreamer_consumerReturn(&rgbCUDA16, m_streamerCPU2CUDA));
CHECK_DW_ERROR(dwImageStreamer_producerReturn(&m_rgbCPU, 2000, m_streamerCPU2CUDA));
CHECK_DW_ERROR(dwImage_destroy(m_rgbCPU));
CHECK_DW_ERROR(dwImage_destroy(rgbCUDA16));

And I got that error:
terminate called after throwing an instance of ‘std::runtime_error’
what(): [2021-08-09 20:11:06] DW Error DW_INVALID_HANDLE executing DW function:
dwImage_destroy(rgbCUDA16)

Then, I change by:
CHECK_DW_ERROR(dwImageStreamer_consumerReturn(&rgbCUDA16, m_streamerCPU2CUDA));
CHECK_DW_ERROR(dwImageStreamer_producerReturn(&m_rgbCPU, 2000, m_streamerCPU2CUDA));
(dwImage_destroy(m_rgbCPU));
(dwImage_destroy(rgbCUDA16));

And it runned during 7 minute then the memory is full and I got these error:
terminate called after throwing an instance of ‘std::runtime_error’
what(): [2021-08-09 20:21:22] DW Error DW_FAILURE executing DW function:
dwImage_create(&rgbCUDA16, cudaProp16, sdk)

Right before returning that error, using nvidia-smi, I see Volatile GPU-Util between 5 - 7 % and the initial Memory-Usage was 678MiB / 3911MiB

±----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro T2000 Off | 00000000:01:00.0 On | N/A |
| N/A 55C P0 14W / N/A | 3744MiB / 3911MiB | 5% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1342 G /usr/lib/xorg/Xorg 39MiB |
| 0 N/A N/A 1578 G /usr/bin/gnome-shell 73MiB |
| 0 N/A N/A 2209 G /usr/lib/xorg/Xorg 371MiB |
| 0 N/A N/A 2343 G /usr/bin/gnome-shell 63MiB |
| 0 N/A N/A 3327 G …AAAAAAAAA= --shared-files 22MiB |
| 0 N/A N/A 4144 G …AAAAAAAAA= --shared-files 77MiB |
| 0 N/A N/A 6069 G …AAAAAAAAA= --shared-files 25MiB |
| 0 N/A N/A 14560 C …ction/detection 3063MiB |
±----------------------------------------------------------------------------+

do you have any idea @VickNV or @SivaRamaKrishnaNV ?