NPP Warp Perspectiv with 3 channels problem?


I am facing a problem with the NPP function " nppiWarpPerspective_8u_C3R ".
NppStatus nppiWarpPerspective_8u_C3R (const Npp8u * pSrc, NppiSize srcSize, int
nSrcStep, NppiRect srcRoi, Npp8u * pDst, int nDstStep,
NppiRect dstRoi, const double coeffs[3][3], int interpolation)

I am using CUDA Toolkit 4.0 and Windows 7 32 bits.
The environment I work us Visual Studio 2008.

The idea is to use nppiWarpPerspective_8u_C3R function to make transformations for color images (RGB - 3 channels).
The same function, but for greyscale images works very well.
For the coefficients I use the Open CV and I set them directly to the NPP function.
I will attach the code bellow :
My question is :
Is there a bug with the version of WarpPerspective for 3 channels…or is something with my memory allocation?.

Thank you very much for your help!

ImageOpen.cpp (3.25 KB)

WarpPerspectiveGPU.cpp (2.05 KB)

Can you provide some information how the output is not as you epxect? An image with the wrong output would be helpful.


Thank you very much for your fast reply.

At the end I discovered where the problem is.
The problem was not with the NPP function. It was my data copying.
I didn’t know that the width and the height of “cudaMemcpy2D” must be set in bytes and not as an integer, like the number of pixels especially when color images are used.
I found out only after I read the description of the “cudaMemcpy2D” function in Programming Guide.
Now everything works ok.
With this test I discovered that the memory copying from and to host is not very fast.
It takes almost 5 ms (4.6 ms) in one way for one big image like 1280x1024 (3 channels).
The “WarpPerspective” function works very fast, around 2 ms. So in total : CopytoHost + WarpPerspective+CopytoDevice = 12 ms.
In comparison with the OpenCV library for the same function the GPU(GeForce GTX 560TI) is only 3 times faster.
The bottleneck is (I think) the host memory.

Maybe someone will need this information in the future :).

I still have some questions. Is normal to take so much time the data copying?
For other gpu boards like Tesla, is faster to copy data between Host and Device?

Because in the future will be need for more image resolution.
If the copy of data takes to much time, the GPU implementation of an image processing algorithm will not be faster than an optimized implementation in CPU.


Yes, the cost for copying is substantial and there is little point in trying to replace a single CPU function like Warp with a sequence like:


Copy Host to Device

Warp on Device

Copy Device to Host

There are generally two approaches that achieve useful speedups in practice:


To perform many operations on the device (i.e. GPU) without transfering data back and forth. That amortizes host-device-device-host data transfer costs.

In cases of streaming data (e.g. video) hide the data-transfer cost via asynchronous memory copies. If the GPU processing time is about the same as the data-transfer times, the data-transfer cost can be hidden.