vpiSubmitTemporalNoiseReduction fails with VPI_ERROR_INVALID_ARGUMENT on buffer created by vpiImageCreateWrapper/VPI_IMAGE_BUFFER_CUDA_PITCH_LINEAR

jhnlmn · December 21, 2024, 1:55am

Hi,
I already mentioned this failure in my previous post How to prevent vpiSubmitConvertImageFormat from calling cudaGraphicsEGLRegisterImage, which kills performance?
I understand that VPI algos on images created with vpiImageCreateWrapper may be somewhat slower.
But I was not able to make vpiSubmitTemporalNoiseReduction work on VpiImage, which wraps a preallocated CUDA memory at all.

See the attached sample code. You can run it to use regular vpiImageCreate and call vpiSubmitTemporalNoiseReduction, which succeeds:

width=256 height=256 ./tnr_wrap # Use vpiImageCreate

And then run:

useWrapper=1 width=256 height=256 ./tnr_wrap # Use vpiImageCreateWrapper
# Prints:
terminate called after throwing an instance of 'std::runtime_error'
  what():  vpiSubmitTemporalNoiseReduction(stream, backend, tnr, imgPrevious, imgInput, imgOutput, &params)
VPI_ERROR_INVALID_ARGUMENT: Previous frame must have the same format configured during payload creation
Aborted

Note that I printed detailed information about created images and the information is identical in both cases (except for CUDA addresses):

imgInput vpiImageGetFormat ret 0xa2a10d1 NV12 Image Size 256x256 numPlanes 2
plane #0: type 0xffff1001 VPI_PIXEL_TYPE_U8 size 256x256  0xffff1001 256 256 pitch 256 0x205807000
plane #1: type 0xffff1011 VPI_PIXEL_TYPE_2U8 size 128x128  0xffff1011 128 128 pitch 256 0x205817000
imgPrevious vpiImageGetFormat ret 0xa2a10d1 NV12 Image Size 256x256 numPlanes 2
plane #0: type 0xffff1001 VPI_PIXEL_TYPE_U8 size 256x256  0xffff1001 256 256 pitch 256 0x205867000
plane #1: type 0xffff1011 VPI_PIXEL_TYPE_2U8 size 128x128  0xffff1011 128 128 pitch 256 0x205877000
imgOutput vpiImageGetFormat ret 0xa2a10d1 NV12 Image Size 256x256 numPlanes 2
plane #0: type 0xffff1001 VPI_PIXEL_TYPE_U8 size 256x256  0xffff1001 256 256 pitch 256 0x205837000
plane #1: type 0xffff1011 VPI_PIXEL_TYPE_2U8 size 128x128  0xffff1011 128 128 pitch 256 0x205847000

So, why vpiSubmitTemporalNoiseReduction would fail?
And how to correct my code to make vpiSubmitTemporalNoiseReduction work on preallocated CUDA memory?

Thank you

/*
Usage:
Save this file to tnr_wrap.cpp

# Compile
g++  -o tnr_wrap -I/usr/local/cuda-12.6/targets/aarch64-linux/include \
    ./tnr_wrap.cpp -L/usr/local/cuda-12.6/targets/aarch64-linux/lib/  -lnvvpi -lcudart

width=256 height=256 ./tnr_wrap # Use vpiImageCreate
# Prints ok

useWrapper=1 width=256 height=256 ./tnr_wrap # Use vpiImageCreateWrapper
# Prints:
terminate called after throwing an instance of 'std::runtime_error'
  what():  vpiSubmitTemporalNoiseReduction(stream, backend, tnr, imgPrevious, imgInput, imgOutput, &params)
VPI_ERROR_INVALID_ARGUMENT: Previous frame must have the same format configured during payload creation
Aborted

*/

#include <vpi/Event.h>
#include <vpi/Image.h>
#include <vpi/Status.h>
#include <vpi/Stream.h>
#include <vpi/algo/TemporalNoiseReduction.h>

#include <cuda_runtime.h>

#include <fstream>
#include <iostream>
#include <sstream>

#define CHECK_STATUS(STMT)                                    \
    do                                                        \
    {                                                         \
        VPIStatus status = (STMT);                            \
        if (status != VPI_SUCCESS)                            \
        {                                                     \
            char buffer[VPI_MAX_STATUS_MESSAGE_LENGTH];       \
            vpiGetLastStatusMessage(buffer, sizeof(buffer));  \
            std::ostringstream ss;                            \
            ss << "" #STMT "\n";                              \
            ss << vpiStatusGetName(status) << ": " << buffer; \
            throw std::runtime_error(ss.str());               \
        }                                                     \
    } while (0);

void CreateImageWrapper(int width, int height, VPIImageFormat imgFormat, uint64_t backend, VPIImage * vpiImage)
{
    assert(imgFormat == VPI_IMAGE_FORMAT_NV12_ER);

    int pitch  = ((width + 255)/256)*256;
    void * cudaPtrY {}, * cudaPtrUV;
    assert(cudaSuccess == cudaMalloc(&cudaPtrY, pitch * height));
    assert(cudaSuccess == cudaMalloc(&cudaPtrUV, pitch * height/2));

    VPIImageData vpiImageData;
    vpiImageData.bufferType = VPI_IMAGE_BUFFER_CUDA_PITCH_LINEAR;
    vpiImageData.buffer.pitch.format = VPI_IMAGE_FORMAT_NV12_ER;
    vpiImageData.buffer.pitch.numPlanes = 2;
    vpiImageData.buffer.pitch.planes[0].pixelType = VPI_PIXEL_TYPE_U8;
    vpiImageData.buffer.pitch.planes[0].width = width;
    vpiImageData.buffer.pitch.planes[0].height = height;
    vpiImageData.buffer.pitch.planes[0].pitchBytes = pitch;
    vpiImageData.buffer.pitch.planes[0].data = cudaPtrY;

    vpiImageData.buffer.pitch.planes[1].pixelType = VPI_PIXEL_TYPE_2U8;
    vpiImageData.buffer.pitch.planes[1].width = width / 2;
    vpiImageData.buffer.pitch.planes[1].height = height / 2;
    vpiImageData.buffer.pitch.planes[1].pitchBytes = pitch;
    vpiImageData.buffer.pitch.planes[1].data = cudaPtrUV;

    VPIImageWrapperParams wrapperParams;
    wrapperParams.colorSpec = VPI_COLOR_SPEC_DEFAULT;

    CHECK_STATUS(vpiImageCreateWrapper(&vpiImageData, &wrapperParams, backend, vpiImage));
}

bool VpiImagePrintFormat(VPIImage image, const char * comment)
{
    VPIImageFormat format {};
    CHECK_STATUS(vpiImageGetFormat(image, &format));

    int32_t imageWidth {}, imageHeight {};
    CHECK_STATUS(vpiImageGetSize(image, &imageWidth, &imageHeight));

    uint32_t fourCC = vpiImageFormatGetFourCC(format);
    int numPlanes = vpiImageFormatGetPlaneCount(format);

    printf("%s vpiImageGetFormat ret 0x%x %.4s Image Size %dx%d numPlanes %d\n",
        comment, (int)format, (char*)&fourCC,
        (int)imageWidth, (int)imageHeight,
        numPlanes);

    VPIImageData imgdata;
    CHECK_STATUS(vpiImageLockData(image, VPI_LOCK_READ, VPI_IMAGE_BUFFER_CUDA_PITCH_LINEAR, &imgdata));

    for(int planeIdx = 0; planeIdx < numPlanes; planeIdx++)
    {
        printf("plane #%d: type 0x%x %s size %dx%d  0x%x %d %d pitch %d %p\n",
            planeIdx,
            (int)vpiImageFormatGetPlanePixelType(format, planeIdx),
            vpiPixelTypeGetName(vpiImageFormatGetPlanePixelType(format, planeIdx)),
            (int)vpiImageFormatGetPlaneWidth(format, imageWidth, planeIdx),
            (int)vpiImageFormatGetPlaneHeight(format, imageHeight, planeIdx),

            (int)imgdata.buffer.pitch.planes[planeIdx].pixelType,
            imgdata.buffer.pitch.planes[planeIdx].width,
            imgdata.buffer.pitch.planes[planeIdx].height,
            imgdata.buffer.pitch.planes[planeIdx].pitchBytes,
            imgdata.buffer.pitch.planes[planeIdx].data

        );
    }
    CHECK_STATUS(vpiImageUnlock(image));

    //Sample output for NV12
    //imgInput vpiImageGetFormat ret 0xa2a10d1 NV12 Image Size 256x256 numPlanes 2
    //plane #0: type 0xffff1001 VPI_PIXEL_TYPE_U8 size 256x256  0xffff1001 256 256 pitch 256 0x205807000
    //plane #1: type 0xffff1011 VPI_PIXEL_TYPE_2U8 size 128x128  0xffff1011 128 128 pitch 256 0x205817000
    return true;
}//bool VpiImagePrintFormat


int main(int argc, char *argv[])
{
    bool useVic = getenv("VPI_BACKEND_VIC") != nullptr;
    uint64_t backend = useVic? VPI_BACKEND_VIC : VPI_BACKEND_CUDA;

    bool useWrapper = getenv("useWrapper") != nullptr;

    const char * temp = getenv("width");
    int width = temp? strtol(temp, nullptr, 10) : 256;

    temp = getenv("height");
    int height = temp? strtol(temp, nullptr, 10) : 256;

    VPIImageFormat imgFormat = VPI_IMAGE_FORMAT_NV12_ER;

    VPIStream stream     = NULL;
    CHECK_STATUS(vpiStreamCreate(backend, &stream));

    VPIPayload tnr    = NULL;
    CHECK_STATUS(vpiCreateTemporalNoiseReduction(backend, width, height, imgFormat, VPI_TNR_DEFAULT, &tnr));

    VPITNRParams params;
    CHECK_STATUS(vpiInitTemporalNoiseReductionParams(&params));

    VPIImage imgInput, imgOutput, imgPrevious;
    if(useWrapper)
    {
        CreateImageWrapper(width, height, imgFormat, backend, &imgInput);
        CreateImageWrapper(width, height, imgFormat, backend, &imgOutput);
        CreateImageWrapper(width, height, imgFormat, backend, &imgPrevious);
    }
    else
    {
        CHECK_STATUS(vpiImageCreate(width, height, imgFormat, backend, &imgInput));
        CHECK_STATUS(vpiImageCreate(width, height, imgFormat, backend, &imgOutput));
        CHECK_STATUS(vpiImageCreate(width, height, imgFormat, backend, &imgPrevious));
    }

    VpiImagePrintFormat(imgInput, "imgInput");
    VpiImagePrintFormat(imgOutput, "imgOutput");
    VpiImagePrintFormat(imgPrevious, "imgPrevious");

    CHECK_STATUS(vpiSubmitTemporalNoiseReduction(stream, backend, tnr, imgPrevious, imgInput, imgOutput, &params));
    printf("ok\n");
}

AastaLLL · December 23, 2024, 4:19am

Hi,

We can reproduce the same behavior locally.
Will check it further and provide more info to you later.

Thanks.

AastaLLL · December 25, 2024, 2:00am

Hi,

Thanks for your patience. Please try the below changes:

diff --git a/main.cpp b/main.cpp
index e97c2db..e7fe140 100644
--- a/main.cpp
+++ b/main.cpp
@@ -73,7 +73,7 @@ void CreateImageWrapper(int width, int height, VPIImageFormat imgFormat, uint64_
     VPIImageWrapperParams wrapperParams;
     wrapperParams.colorSpec = VPI_COLOR_SPEC_DEFAULT;
 
-    CHECK_STATUS(vpiImageCreateWrapper(&vpiImageData, &wrapperParams, backend, vpiImage));
+    CHECK_STATUS(vpiImageCreateWrapper(&vpiImageData, NULL, backend, vpiImage));
 }
 
 bool VpiImagePrintFormat(VPIImage image, const char * comment)

We can create the wrapper successfully without passing wrapperParams.

$ useWrapper=1 width=256 height=256 ./tnr_wrap
imgInput vpiImageGetFormat ret 0xa2a10d1 NV12 Image Size 256x256 numPlanes 2
plane #0: type 0xffff1001 VPI_PIXEL_TYPE_U8 size 256x256  0xffff1001 256 256 pitch 256 0x205807000
plane #1: type 0xffff1011 VPI_PIXEL_TYPE_2U8 size 128x128  0xffff1011 128 128 pitch 256 0x205817000
imgOutput vpiImageGetFormat ret 0xa2a10d1 NV12 Image Size 256x256 numPlanes 2
plane #0: type 0xffff1001 VPI_PIXEL_TYPE_U8 size 256x256  0xffff1001 256 256 pitch 256 0x205837000
plane #1: type 0xffff1011 VPI_PIXEL_TYPE_2U8 size 128x128  0xffff1011 128 128 pitch 256 0x205847000
imgPrevious vpiImageGetFormat ret 0xa2a10d1 NV12 Image Size 256x256 numPlanes 2
plane #0: type 0xffff1001 VPI_PIXEL_TYPE_U8 size 256x256  0xffff1001 256 256 pitch 256 0x205867000
plane #1: type 0xffff1011 VPI_PIXEL_TYPE_2U8 size 128x128  0xffff1011 128 128 pitch 256 0x205877000
ok

Thanks.

jhnlmn · December 25, 2024, 9:49am

Yes, this works, thank you.
But the performance is terrible: more than 5 ms for my 2624x1944 image if it was wrapped using vpiImageCreateWrapper vs 2 ms if image was created using vpiImageCreate. So, looks like there is no way to efficiently pass NvBuffer to VIC and other VPI APIs.
So, I wonder whether I should abandon NvBuffer (created by NvBufSurfaceAllocate) and instead switch by entire pipeline to VPIImage (created by vpiImageCreate)?
Will I still be able to run custom CUDA kernels on VPIImage (after vpiImageLockData) as efficiently as on NvBuffer/EGLImageKHR ?
Is there any functionality, which is only available with NvBuffer and not VPIImage?
Thank you

AastaLLL · December 30, 2024, 2:03am

Hi,

Have you maximized the VIC clocks as well?
You can find our script in the document below:

https://docs.nvidia.com/vpi/algo_performance.html#maxout_clocks

In some use cases, the VPIImages using vpiImageCreateWrapper might impact the performance.
That’s because some VPI mechanisms need to handle rather than just buffer data itself.

It’s possible to use VPI and CUDA together but you will need to make sure all the VPI tasks are done before calling CUDA kernels.
Below is a relevant discussion for your reference:

Thanks.

system · January 29, 2025, 12:57am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to prevent vpiSubmitConvertImageFormat from calling cudaGraphicsEGLRegisterImage, which kills performance? Jetson AGX Orin cuda	9	93	December 5, 2024
VPI CUDA interop with managed memory Jetson AGX Xavier cuda , vpi	16	1734	October 18, 2021
VPI image pointing to managed memory Jetson AGX Xavier vpi	15	960	December 5, 2023
Using VPI in GStreamer Jetson AGX Orin camera , gstreamer , documentation , vpi	51	4993	March 8, 2023
Using VPI in a Custom DeepStream Plugin DeepStream SDK gstreamer , vpi	4	1781	October 12, 2021
Why Jetson vic has a significant performance drop? Jetson Xavier NX vpi	8	42	December 19, 2024
VPI Image wraper around NvBufSurface on Jetson AGX Xavier Jetson AGX Xavier vpi	10	820	December 26, 2023
NvBuffer VPI Interoperability Jetson Nano vpi	6	1844	September 12, 2021
Why vpiSubmitTemporalNoiseReduction in 09-tnr takes 5 ms instead of 1 ms as advertized? Jetson AGX Orin vpi	14	54	October 23, 2024
VPI "Fatal assertion error" when trying to use CUDA-wrapped unified memory in remap on VIC Jetson AGX Xavier vpi	6	1286	February 9, 2022

vpiSubmitTemporalNoiseReduction fails with VPI_ERROR_INVALID_ARGUMENT on buffer created by vpiImageCreateWrapper/VPI_IMAGE_BUFFER_CUDA_PITCH_LINEAR

Related topics