Why vpiSubmitTemporalNoiseReduction in 09-tnr takes 5 ms instead of 1 ms as advertized?

Hi,
VPI - Vision Programming Interface: Temporal Noise Reduction lists numbers around 0.93 ms for TNR denoising for 1920x1080 images using VIC on Orin.
I made a copy of /opt/nvidia/vpi3/samples/09-tnr/main.cpp and added some simple time measurements around vpiSubmitTemporalNoiseReduction.
I tried calling vpiStreamSync before and after vpiSubmitTemporalNoiseReduction and measure time using clock_gettime.
Also I tried using vpiEventRecord/vpiEventSync/vpiEventElapsedTimeMillis.
And results are the same: 5.1 - 5.2 ms - more than 5 times longer than benchmark.
How to speed it up?

How I tested:
First resize noisy.mp4 to 1920x1080.

gst-launch-1.0 filesrc location=/opt/nvidia/vpi3/samples/assets/noisy.mp4 ! \
    qtdemux ! queue ! h264parse ! nvv4l2decoder ! nvvidconv ! \
    'video/x-raw(memory:NVMM), width=(int)1920, height=(int)1080' ! \
    nvv4l2h265enc bitrate=20000000 ! h265parse ! queue ! \
    qtmux name=mux ! filesink location=/mnt/tmpfs/1920x1080.mp4 -e

Then compile attached 09_tnr_main.cpp

g++  -I"/usr/include/opencv4" -o 09_tnr_main ./09_tnr_main.cpp  -lopencv_core -lnvvpi -lopencv_videoio

Then increase clocks to max according to VPI - Vision Programming Interface: Performance Benchmark :

sudo ./clocks.sh --max

Then run:

jlMeasureTnrOnly=1 skipFrames=50 ./09_tnr_main vic /mnt/tmpfs/1920x1080.mp4

Result on my Orin is pretty consistent:
statPerFrameTnr (ms) count 100 av 5.207846 min 5.118753 max 5.286271

I tried various tweaks from VPI - Vision Programming Interface: Benchmarking
like
uint64_t streamFlags = (uint64_t)backend | VPI_REQUIRE_BACKENDS;
and
uint64_t memFlags = (uint64_t)backend | VPI_EXCLUSIVE_STREAM_ACCESS;
But they make no difference.

09_tnr_main.txt (11.2 KB)

Thank you

Hi,

CHECK_STATUS(vpiSubmitTemporalNoiseReduction(stream, 0, tnr, curFrame == 1 ? NULL : imgPrevious, imgCurrent,
                                             imgOutput, &params));

It looks like you submit a TNR task per time.
Could you submit the tasks in the batch manner mentioned in the below document?

https://docs.nvidia.com/vpi/algo_performance.html#benchmark

4. Run the algorithm in batches and measure its average running time within each batch. The number of calls in a batch varies with the approximate running time (faster algorithms, larger batch, max 100 calls). This is done to exclude the time spent performing the measurement itself from the algorithm runtime.

Thanks.

First, I just copied the only available example of vpiSubmitTemporalNoiseReduction from /opt/nvidia/vpi3/samples/09-tnr/main.cpp

Second, I need de-noising for camera and there is no chance of using batch processing.

My camera is 5 MP (pretty small by modern standards) and I was hoping that TNR time would be around 2.5 ms based on 1920x1080 benchmark that showed less than 1ms.
But actually 1920x1080 takes 5 ms and 5 MP takes about 12 ms, which is over budget.
Question is whether it is somehow possible to use VIC TNR for real time camera instead of batch processing, as you suggested.

Hi,

Our benchmark data is generated in a batch manner.
If you need to run it by frame, it’s expected to be longer.

To verify this, we loop the vpiSubmitTemporalNoiseReduction with 1000 iterations and get ~1.13ms per frame on average.

statPerFrameTnr (ms) count 100 av 1.129722 min 1.097480 max 1.159875

If only 1 ms latency is available, you can try to use a more lightweight algorithm to meet the requirement.
For example, VPI_TNR_V1 or VPI_TNR_V2 in VPI2.3 (JetPack5).

https://docs.nvidia.com/vpi/2.3/group__VPI__TNR.html#gaea4bf2f9a345df3bd802b552dfb5559a

Thanks.

Running vpiSubmitTemporalNoiseReduction in a loop is not a very realistic example since it is processing the same frame again and again (there is nothing temporal there).
Question is how to use it in a real time camera ISP.

The benchmarking page that you quoted above said about batch processing " This is done to exclude the time spent performing the measurement itself from the algorithm runtime."
But time spent on measurements cannot explain the 5 times difference.
Something else is going on with that VIC.
Why is it so expensive to submit new frames comparing to repeating the same one? Is it a caching issue? Or does it go to sleep between frames? Or is it used by some other system component, like display? Or MPEG decoder?

Do you know?

I am really impressed by quality of VIC de-noising and would hate to switch to some cheaper and poorer version (I tried Cuda version and it is much weaker).

Thank you

Hi,

We can check it further with our internal team.

Usually, there is some overhead when submitting a task.
Will provide more details to you later.

Thanks.

Hi,

Please gather more info with the Nsight System and optimize it accordingly.
For example, you might tweak the image’s enabled backends to make sure no implicit copies are being done.

Regarding this:

Running vpiSubmitTemporalNoiseReduction in a loop is not a very realistic example since it is processing the same frame again and again (there is nothing temporal there).
Question is how to use it in a real time camera ISP.

For the algorithm, it doesn’t matter (much) if it’s fed the same image at every call or not, so batching is still a possible way for optimization.

The main goal of an optimized pipeline is to minimize calls to vpiStreamSync.
Some ways increase latency, but don’t affect the frame rate (no frame dropping), others don’t affect latency, etc.
This is out of VPI’s current scope to provide examples of what exactly to do, because it depends a lot on the entire pipeline processing.

Thanks.

Common mistakes when making benchmarks is to call function in question in a loop on constant input data or do not use the results.
In the first case an optimizing compiler or runtime engine will only call the function once and reuse the result.
In the second case it will skip calling function completely because results are not used.

I wrote the shortest possible example, which reads real raw YUV frames from a file (in tmpfs), calls vpiSubmitTemporalNoiseReduction,vpiStreamSync and saves result to another YUV file.

If I run it on 1000 frames I get:
statPerFrameTnr (ms) count 998 av 4.350390 min 1.130550 max 4.923569
as you can see, it still almost 5 times slower than in your benchmark.
But if I run with parameter repeatOneFrame=1, I get:
statPerFrameTnr (ms) count 998 av 2.753826 min 1.106198 max 2.947012
now it is about 30% faster
If I run with parameter dropResults=1, then:
statPerFrameTnr (ms) count 998 av 1.939635 min 1.049718 max 2.067214
If I run with parameters repeatOneFrame=1 dropResults=1, then:
statPerFrameTnr (ms) count 998 av 0.992015 min 0.976600 max 1.092695
Now results are close to your published results, but these are totally fake results,
because neither input nor output images are accessed.

I tried using Nsight System, as you suggested, but it only confirmed my measurements that
TemporalNoiseReduction takes 5 ms instead of 1:
18.9 5,547,686,816 1,000 5,547,686.8 5,603,200.0 1,326,816 6,224,288 466,479.6 PushPop VPI:vpiStreamSync
18.1 5,311,714,592 1,000 5,311,714.6 5,363,840.0 1,193,536 5,996,288 453,219.4 PushPop VPI:TemporalNoiseReduction
16.5 4,819,926,560 1,000 4,819,926.6 4,875,072.0 788,416 4,968,224 443,469.7 PushPop VPI:sync tegra

Please, help me optimize this simplest code or provide another one, which will use real images, apply TNR and utilize the results and have the time somewhat close to your benchmarks.

Thank you

/*
Usage:
g++  -o tnr_file ./tnr_file.cpp -lnvvpi

sudo mkdir /mnt/tmpfs
sudo chown $USER:$USER /mnt/tmpfs
sudo mount -t tmpfs -o size=16g tmpfs /mnt/tmpfs

gst-launch-1.0 filesrc location=/opt/nvidia/vpi3/samples/assets/noisy.mp4 ! qtdemux ! queue ! h264parse ! avdec_h264 ! \
    nvvidconv ! 'video/x-raw, format=YUY2, width=1920, height=1080' ! \
    filesink location=/mnt/tmpfs/out.yuv -e
    
sudo ./clocks.sh --max

VPI_IMAGE_FORMAT_YUYV_ER=1 strength=1 inFile=/mnt/tmpfs/out.yuv outFile=/mnt/tmpfs/out2.yuv \
    width=1920 height=1080 numFrames=1000 ./tnr_file
Result:
    statPerFrameTnr (ms) count 98 av 3.967568 min 1.138996 max 4.501203
    statPerFrameTnr (ms) count 998 av 4.350390 min 1.130550 max 4.923569
    
DISPLAY=:0 ffplay -v info -f rawvideo -pixel_format yuyv422 -video_size 1920x1080 /mnt/tmpfs/out2.yuv
    
repeatOneFrame=1 VPI_IMAGE_FORMAT_YUYV_ER=1 strength=1 inFile=/mnt/tmpfs/out.yuv outFile=/mnt/tmpfs/out2.yuv \
    width=1920 height=1080 numFrames=1000 ./tnr_file
Result:
    statPerFrameTnr (ms) count 98 av 2.559179 min 1.067350 max 2.887748
    statPerFrameTnr (ms) count 998 av 2.753826 min 1.106198 max 2.947012
    
dropResults=1 VPI_IMAGE_FORMAT_YUYV_ER=1 strength=1 inFile=/mnt/tmpfs/out.yuv outFile=/mnt/tmpfs/out2.yuv \
    width=1920 height=1080 numFrames=1000 ./tnr_file
Result:
    statPerFrameTnr (ms) count 98 av 1.824454 min 1.064086 max 1.985774
    statPerFrameTnr (ms) count 998 av 1.939635 min 1.049718 max 2.067214
    
repeatOneFrame=1 dropResults=1 VPI_IMAGE_FORMAT_YUYV_ER=1 strength=1 inFile=/mnt/tmpfs/out.yuv outFile=/mnt/tmpfs/out2.yuv \
    width=1920 height=1080 numFrames=1000 ./tnr_file
    statPerFrameTnr (ms) count 998 av 0.992015 min 0.976600 max 1.092695
*/

#include <vpi/Event.h>
#include <vpi/Image.h>
#include <vpi/Status.h>
#include <vpi/Stream.h>
#include <vpi/algo/ConvertImageFormat.h>
#include <vpi/algo/TemporalNoiseReduction.h>

#include <algorithm>
#include <cstring> // for memset
#include <fstream>
#include <iostream>
#include <map>
#include <sstream>
#include <vector>

uint64_t getTimeNS() 
{
    struct timespec ts;
    clock_gettime(CLOCK_MONOTONIC, &ts);
    return ts.tv_sec * 1000 * 1000 * 1000 + ts.tv_nsec;
}

struct Stat
{
    void Add(int value)
    {
        if(!this->count || this->min > value)
        {
            this->min = value;
        }
        if(!this->count || this->max < value)
        {
            this->max = value;
        }
        this->count++;
        this->total += value;
    }
    
    void Print(const char * name, double ratio)
    {
        printf("%s count %d av %lf min %lf max %lf\n", 
               name, this->count, this->count? (this->total * ratio / this->count) : 0.0, ratio * this->min, ratio * this->max);
    }
    
    int min {-1};
    int max {-1};
    int count {0};
    long long total {0};
};

Stat statPerFrameTnr;

#define CHECK_STATUS(STMT)                                    \
    do                                                        \
    {                                                         \
        VPIStatus status = (STMT);                            \
        if (status != VPI_SUCCESS)                            \
        {                                                     \
            char buffer[VPI_MAX_STATUS_MESSAGE_LENGTH];       \
            vpiGetLastStatusMessage(buffer, sizeof(buffer));  \
            std::ostringstream ss;                            \
            ss << "" #STMT "\n";                              \
            ss << vpiStatusGetName(status) << ": " << buffer; \
            throw std::runtime_error(ss.str());               \
        }                                                     \
    } while (0);

int main(int argc, char *argv[])
{
    VPIStream stream     = NULL;
    VPIImage imgPrevious = NULL, imgInput = NULL, imgOutput = NULL;
    VPIImage imageCvWrapper = NULL;
    VPIPayload tnr    = NULL;

    // main return value
    int retval = 0;

    VPIBackend backend {VPI_BACKEND_VIC};
    
    const char * inFileName = getenv("inFile");
    std::ifstream inFile(inFileName);
    if(!inFile)
    {
        printf("Cannot open %s for reading\n", inFileName);
        return -1;
    }
    inFile.seekg (0, inFile.end);
    int inFileSize = inFile.tellg();
    inFile.seekg (0, inFile.beg);
    
    const char * outFileName = getenv("outFile");
    std::ofstream outFile(outFileName);
    if(!outFile)
    {
        printf("Cannot open %s for writing\n", outFileName);
        return -1;
    }
    
    const char * temp = getenv("wifth");
    int width = temp? strtol(temp, nullptr, 10) : 1920;
    
    temp = getenv("height");
    int height = temp? strtol(temp, nullptr, 10) : 1080;
    
    temp = getenv("numFrames");
    int numFrames = temp? strtol(temp, nullptr, 10) : 1000;
        
    temp = getenv("printFormat");
    int printFormat = temp && *temp == '1';

    CHECK_STATUS(vpiStreamCreate(backend, &stream));
    
    uint64_t memFlags {backend};
    memFlags |= VPI_BACKEND_CPU;//Need this to lock images
    
    temp = getenv("VPI_EXCLUSIVE_STREAM_ACCESS");
    if(temp && *temp == '1')
    {
        memFlags |= VPI_EXCLUSIVE_STREAM_ACCESS;
    }
    temp = getenv("VPI_BACKEND_CUDA");
    if(temp && *temp == '1')
    {
        memFlags |= VPI_BACKEND_CUDA;
    }
    VPIImageFormat imgFormat = VPI_IMAGE_FORMAT_NV12_ER;
    temp = getenv("VPI_IMAGE_FORMAT_YUYV_ER");
    if(temp && *temp == '1')
    {
        imgFormat = VPI_IMAGE_FORMAT_YUYV_ER;
    }
    temp = getenv("VPI_IMAGE_FORMAT_UYVY_ER");
    if(temp && *temp == '1')
    {
        imgFormat = VPI_IMAGE_FORMAT_UYVY_ER;
    }
    
    CHECK_STATUS(vpiImageCreate(width, height, imgFormat, memFlags, &imgInput));
    CHECK_STATUS(vpiImageCreate(width, height, imgFormat, memFlags, &imgPrevious));
    CHECK_STATUS(vpiImageCreate(width, height, imgFormat, memFlags, &imgOutput));
    
    CHECK_STATUS(vpiCreateTemporalNoiseReduction(backend, width, height, imgFormat, VPI_TNR_DEFAULT, &tnr));

    VPITNRParams params;
    CHECK_STATUS(vpiInitTemporalNoiseReductionParams(&params));
    
    temp = getenv("preset");
    if(temp)
    {
        params.preset = (VPITNRPreset)strtol(temp, nullptr, 10);
    }
    temp = getenv("strength");
    if(temp)
    {
        params.strength = strtod(temp, nullptr);
    }
    printf("tnr params preset: %d strength: %lf\n", (int)params.preset, (double)params.strength);
    
    temp = getenv("repeatOneFrame");
    bool repeatOneFrame = temp && *temp == '1';
    
    temp = getenv("dropResults");
    bool dropResults = temp && *temp == '1';

    VPIEvent evStart = NULL;
    VPIEvent evEnd = NULL;
    CHECK_STATUS(vpiEventCreate(backend, &evStart));
    CHECK_STATUS(vpiEventCreate(backend, &evEnd));
    
    printf("Run loop\n");
    
    int filePos = 0;
    for(int frameOrdinal = 0; frameOrdinal < numFrames; frameOrdinal++)
    {
        if(!repeatOneFrame || frameOrdinal == 0)
        {
            //This is one way to read file: lock VPI image and read directly to it.
            VPIImageData imgdata;
            CHECK_STATUS(vpiImageLockData(imgInput, VPI_LOCK_WRITE, VPI_IMAGE_BUFFER_HOST_PITCH_LINEAR, &imgdata));
            int frameSize {};
            for(int planeIdx = 0; planeIdx < imgdata.buffer.pitch.numPlanes; planeIdx++)
            {
                int size = imgdata.buffer.pitch.planes[planeIdx].pitchBytes * imgdata.buffer.pitch.planes[planeIdx].height;
                inFile.read((char*)imgdata.buffer.pitch.planes[planeIdx].data, size);
                frameSize += size;
                if(printFormat && frameOrdinal == 0)
                {
                    printf("planeIdx %d width %d height %d pitchBytes %d\n", planeIdx,
                        imgdata.buffer.pitch.planes[planeIdx].width,
                        imgdata.buffer.pitch.planes[planeIdx].height,
                        imgdata.buffer.pitch.planes[planeIdx].pitchBytes);
                }
            }
            CHECK_STATUS(vpiImageUnlock(imgInput));
            if(!inFile)
            {
                printf("Failed to read frame of size %d at pos %d\n", frameSize, filePos);
                return -1;
            }
            filePos += frameSize;
            //printf("frameOrdinal %d frameSize %d filePos %d\n", frameOrdinal, frameSize, filePos);
            if(filePos == inFileSize)
            {
                filePos = 0;
                //printf("seekg 0\n");
                inFile.seekg (0, inFile.beg);
            }
        }
        uint64_t timeStart = getTimeNS();
        
        CHECK_STATUS(vpiSubmitTemporalNoiseReduction(stream, 0, tnr, 
            frameOrdinal == 0 ? nullptr: imgPrevious, imgInput, imgOutput, &params));
        CHECK_STATUS(vpiStreamSync(stream));
        
        if(frameOrdinal >= 2)//Do not count first few frames
        {
            statPerFrameTnr.Add( (int)(getTimeNS() - timeStart) );
        }
        
        if(!dropResults)
        {
            VPIImageData imgdata;
            CHECK_STATUS(vpiImageLockData(imgOutput, VPI_LOCK_READ, VPI_IMAGE_BUFFER_HOST_PITCH_LINEAR, &imgdata));
            for(int planeIdx = 0; planeIdx < imgdata.buffer.pitch.numPlanes; planeIdx++)
            {
                int size = imgdata.buffer.pitch.planes[planeIdx].pitchBytes * imgdata.buffer.pitch.planes[planeIdx].height;
                outFile.write((const char*)imgdata.buffer.pitch.planes[planeIdx].data, size);
            }
            CHECK_STATUS(vpiImageUnlock(imgOutput));
        }
        
        std::swap(imgPrevious, imgOutput);
    }//for(int frameOrdinal = 0; frameOrdinal < numFrames; frameOrdinal++)
    
    printf("repeatOneFrame=%d dropResults=%d\n", repeatOneFrame, dropResults);
    statPerFrameTnr.Print("statPerFrameTnr (ms)", 1E-6);
    
    vpiStreamDestroy(stream);
    vpiPayloadDestroy(tnr);
    vpiImageDestroy(imgPrevious);
    vpiImageDestroy(imgInput);
    vpiImageDestroy(imgOutput);
    vpiImageDestroy(imageCvWrapper);

    return 0;
}

Hi,

Thanks for the feedback.

We will give it a check to see if the same behavior (fake results) can be reproduced in our environment.
Then will check with our internal team for further suggestions.

Thanks.

Hi,

We tested your source and the results are different from what you shared.

In the general case, we got:

$ VPI_IMAGE_FORMAT_YUYV_ER=1 strength=1 inFile=./out.yuv outFile=./out2.yuv     width=1920 height=1080 numFrames=1000 ./tnr_file
tnr params preset: 0 strength: 1.000000
Run loop
repeatOneFrame=0 dropResults=0
statPerFrameTnr (ms) count 998 av 1.058258 min 0.986588 max 1.191355

Using the same input, we got:

$ repeatOneFrame=1 VPI_IMAGE_FORMAT_YUYV_ER=1 strength=1 inFile=./out.yuv outFile=./out2.yuv     width=1920 height=1080 numFrames=1000 ./tnr_file
tnr params preset: 0 strength: 1.000000
Run loop
repeatOneFrame=1 dropResults=0
statPerFrameTnr (ms) count 998 av 0.999805 min 0.982971 max 1.057563

Drop the output, we got:

$ repeatOneFrame=1 dropResults=1 VPI_IMAGE_FORMAT_YUYV_ER=1 strength=1 inFile=./out.yuv outFile=./out2.yuv     width=1920 height=1080 numFrames=1000 ./tnr_file
tnr params preset: 0 strength: 1.000000
Run loop
repeatOneFrame=1 dropResults=1
statPerFrameTnr (ms) count 998 av 0.987270 min 0.969499 max 1.115483

As you can see, three cases are close to the benchmark table.

Thanks.

Hi,

Just for your reference, we also test read/write YUV from /mnt/tmpfs/ but the result is still similar.

repeatOneFrame=0 dropResults=0

$ VPI_IMAGE_FORMAT_YUYV_ER=1 strength=1 inFile=/mnt/tmpfs/out.yuv outFile=/mnt/tmpfs/out2.yuv     width=1920 height=1080 numFrames=1000 ./tnr_file
tnr params preset: 0 strength: 1.000000
Run loop
repeatOneFrame=0 dropResults=0
statPerFrameTnr (ms) count 998 av 0.997778 min 0.984299 max 1.078697

repeatOneFrame=1 dropResults=1

$ repeatOneFrame=1 dropResults=1 VPI_IMAGE_FORMAT_YUYV_ER=1 strength=1 inFile=/mnt/tmpfs/out.yuv outFile=/mnt/tmpfs/out2.yuv     width=1920 height=1080 numFrames=1000 ./tnr_file
tnr params preset: 0 strength: 1.000000
Run loop
repeatOneFrame=1 dropResults=1
statPerFrameTnr (ms) count 998 av 0.985354 min 0.969035 max 1.062569

Thanks.

Hi,
You forgot to mention that in the past week or two you modified clocks.sh on VPI - Vision Programming Interface: Performance Benchmark
in particular replaced
vicctrl=/sys/devices/platform/13e40000.host1x/15340000.vic
vicfreqctrl=$vicctrl/devfreq/15340000.vic
by
vicctrl=/sys/devices/platform/bus@0/13e00000.host1x/15340000.vic
vicfreqctrl=$vicctrl/devfreq/15340000.vic
With the new clocks.sh I have the same results as you.
Thank you

Hi,

Sorry for the inconvenience.
The original clocks.sh is for JetPack 5; we recently updated it for JetPack 6 (device node changes).

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.