CUDA parallel execution of NPP functions

Hi all,

I was wondering if it is possible to execute multiple NPP functions on GPU in parallel?

I’m using the function

NppStatus nppiNormDiff_L2_8u_C1R    (   const Npp8u *   pSrc1,
int     nSrcStep1,
const Npp8u *   pSrc2,
int     nSrcStep2,
NppiSize    oSizeROI,
Npp64f *    pRetVal 
)

on an image and an image patch. The problem is, that the size of the matrices are very small (16x16), so that I can’t make much use of the GPU performance (in fact it’s slower than on CPU). The function gets called a lot, because I need to compare the norm of about 200x3200 patches. So I would like the nppiNorm function to be called multiple times in parallel. Is that possible?

Cheers, Andreas