Could someone give one code example - how to use nppiSqrDistanceValid_Norm_8u32f_C1R_Ctx?

could someone give one code example - how to use nppiSqrDistanceValid_Norm_8u32f_C1R_Ctx (c++ & windows 10 & cuda 10.2)?
The function was running so slow with default stream, since created multipe streams on multiple threads.

Hi Jemma,
In CUDA Toolkit 10.2 - NPP support only default stream, it has a issue with multiple streams handling.
Above issue of multiple streams is fixed in the CUDA Toolkit 11.2. I will recommend to upgrade CUDA Toolkit.

I’ve attached sample example for nppiSqrDistanceValid_Norm_8u32f_C1R_Ctx API use case. SampleTestNPP.7z (3.9 KB)

1 Like

ty, I will download CUDA Toolkit 11.2 and try to build it.

Hi mkhadatare,
I dowloaded your sampleTestNPP.7z, found this was one 10.2 sample, and modified to multiple streams with multiple threads. It seems working for 10.2 as below.

But I found there were some instrumentations(red rectangles) shown in the attached picture. I am not familar with NV profiler now, what does mean “instrumentation”? which main functions are located in the “instrumentation” section?
I’ve attached the NVVP file. NewSession22.nvvp (1.7 MB)

What formular for nppiSqrDistanceValid_Norm_8u32f_C1R_Ctx ? thanks in advance.

This was running with Cuda 11.2, but it seems a little slower than 10.2.
Hi, mkhadatare, why the higher version 11.2 is slower than 10.2 ( such as stream creation) in Performance Primitives library?
My Israel coworker also complained that I remember there was strange behavior with 11…
would u like to recommend one special tool to analysize the details?

Hi Jemma,
I highly recommend you use the latest tools for profiling. Nsight Systems and Nsight Compute.

Hi Jemma,

Mathematical formulation for Computes normalized valid Euclidean distance between an image and a template.

The squared Euclidean distance Stx ( r,c ) between a template and an image for the pixel in row r and column c is given by the equation:
where x ( r,c ) is the image pixel value in row r and column c , and t ( j,i ) is the template pixel value in row j and column i ; template size is tplCols by tplRows and its center is positioned at ( r,c ).

Normalized SSD: σtx(r,c)

Here Rxx and Rtt denote the auto-correlation of the image and the template, respectively:

You can refer this link for formula -

1 Like

Hi Jemma,
Use attached code with additional stream passing to data transfers (H2D and D2H) improvement

SampleTestNPP.cpp (5.8 KB)

Additional tip to improve the perf - If application build with statically link to specific NPP library routine here (nppist.a - Linux only) saves unnecessarily loading other library by CUDA runtime.

Dear mkhadatare,
If the image only has one pixel value in your attached SampleTestNPP.cpp, Stx ( r,c ) value should be (2-1)^2 = 1 based on your given equation, but if run the modified SampleTestNPP.cpp with Cuda 11.2, it will show “0.5” in console window. Would u like to check and verify your formula again?
int host_input[input_size] = { 2};
int host_template[template_size] = { 1};

Dear mkhadatare,
Ty, it seems nomalized SSD for nppiSqrDistanceValid_Norm_8u32f_C1R_Ctx, but quite time consuming to calculate square root in generally. Do u have one function only calculates squared Euclidean distance?

Dear Mnicely,
I did download Nisight Systems Version: 2020.3.2.6-87e152c Windows-x64, it is showing more details, thanks again.
Let me download and play Nsight Compute.

Sorry, I found I did mistake to run 11.2 and 10.2 in the different computers with the different GPUs, so got the slower compute speed for 11.2 with 6G GPU than 10.2 with 24G GPU, hahaha.


Please note that Nsight Compute is for analyzing performance of an individual kernel, specifically custom kernels. Don’t go down the rabbit-hole of profiling an NPP, or any NVIDIA library kernel.