Hi, VPI - Vision Programming Interface: Temporal Noise Reduction lists numbers around 0.93 ms for TNR denoising for 1920x1080 images using VIC on Orin.
I made a copy of /opt/nvidia/vpi3/samples/09-tnr/main.cpp and added some simple time measurements around vpiSubmitTemporalNoiseReduction.
I tried calling vpiStreamSync before and after vpiSubmitTemporalNoiseReduction and measure time using clock_gettime.
Also I tried using vpiEventRecord/vpiEventSync/vpiEventElapsedTimeMillis.
And results are the same: 5.1 - 5.2 ms - more than 5 times longer than benchmark.
How to speed it up?
How I tested:
First resize noisy.mp4 to 1920x1080.
Result on my Orin is pretty consistent:
statPerFrameTnr (ms) count 100 av 5.207846 min 5.118753 max 5.286271
I tried various tweaks from VPI - Vision Programming Interface: Benchmarking
like
uint64_t streamFlags = (uint64_t)backend | VPI_REQUIRE_BACKENDS;
and
uint64_t memFlags = (uint64_t)backend | VPI_EXCLUSIVE_STREAM_ACCESS;
But they make no difference.
4. Run the algorithm in batches and measure its average running time within each batch. The number of calls in a batch varies with the approximate running time (faster algorithms, larger batch, max 100 calls). This is done to exclude the time spent performing the measurement itself from the algorithm runtime.
My camera is 5 MP (pretty small by modern standards) and I was hoping that TNR time would be around 2.5 ms based on 1920x1080 benchmark that showed less than 1ms.
But actually 1920x1080 takes 5 ms and 5 MP takes about 12 ms, which is over budget.
Question is whether it is somehow possible to use VIC TNR for real time camera instead of batch processing, as you suggested.
Our benchmark data is generated in a batch manner.
If you need to run it by frame, it’s expected to be longer.
To verify this, we loop the vpiSubmitTemporalNoiseReduction with 1000 iterations and get ~1.13ms per frame on average.
statPerFrameTnr (ms) count 100 av 1.129722 min 1.097480 max 1.159875
If only 1 ms latency is available, you can try to use a more lightweight algorithm to meet the requirement.
For example, VPI_TNR_V1 or VPI_TNR_V2 in VPI2.3 (JetPack5).
Running vpiSubmitTemporalNoiseReduction in a loop is not a very realistic example since it is processing the same frame again and again (there is nothing temporal there).
Question is how to use it in a real time camera ISP.
The benchmarking page that you quoted above said about batch processing " This is done to exclude the time spent performing the measurement itself from the algorithm runtime."
But time spent on measurements cannot explain the 5 times difference.
Something else is going on with that VIC.
Why is it so expensive to submit new frames comparing to repeating the same one? Is it a caching issue? Or does it go to sleep between frames? Or is it used by some other system component, like display? Or MPEG decoder?
Do you know?
I am really impressed by quality of VIC de-noising and would hate to switch to some cheaper and poorer version (I tried Cuda version and it is much weaker).
Please gather more info with the Nsight System and optimize it accordingly.
For example, you might tweak the image’s enabled backends to make sure no implicit copies are being done.
Regarding this:
Running vpiSubmitTemporalNoiseReduction in a loop is not a very realistic example since it is processing the same frame again and again (there is nothing temporal there).
Question is how to use it in a real time camera ISP.
For the algorithm, it doesn’t matter (much) if it’s fed the same image at every call or not, so batching is still a possible way for optimization.
The main goal of an optimized pipeline is to minimize calls to vpiStreamSync.
Some ways increase latency, but don’t affect the frame rate (no frame dropping), others don’t affect latency, etc.
This is out of VPI’s current scope to provide examples of what exactly to do, because it depends a lot on the entire pipeline processing.