nppiRotate_8u_C1R and NPP_STEP_ERROR

Hello,

I am trying to use nppiRotate_8u_C1R, but I am always getting NPP_STEP_ERROR.
I am 100% sure that the step I am using is not 0 and is not less than the width.
I first used cudaMalloc, and the step was equal to the width, then I tried nppiMalloc_8u_C1 and used the pitch returned from it, but I got the same NPP_STEP_ERROR error.
The destination image width and height is calculated to contain the whole rotated image, I mean the destination dimensions is not equal to the source dimensions, could it be the problem?

Thanks for advance.

Hi, can you please provide the following information so we can investigate your problem:

  1. What version on NPP?
  2. What OS?
  3. What are the exact parameters you’re passing? (For the pointer parameters I’d like to know what parameters were used to allocate and if there was pointer arithmetic performed on those pointers prior to passing them to the primitive.)

Thanks,

–Frank

Thank you for your reply.

I am using cuda toolkit 4.1, and compiling 64bit application.

OS: Windows 7 64 bit.

dev_src = nppiMalloc_8u_C1(srcImageDevice.width=8984, srcImageDevice.height=6732, &srcImageDevice.pitch);

resulting pitch = 9216

dev_dst = nppiMalloc_8u_C1(dstImageDevice.width=11205, dstImageDevice.height=10940, &dstImageDevice.pitch);

resulting pitch = 11264

I know the following is inefficient, but I wanted to know where is the problem, so I used nppiMalloc. Since I have different pitch for device and host I had to make this loop.

for(int i = 0; i < srcImageDevice.height; i++)

{

	cudaStatus = cudaMemcpy(dev_src+srcImageDevice.pitch*i, srcImage.data+srcImage.pitch*i, srcImage.width, cudaMemcpyHostToDevice);

}
angle = 2.4403193601384716;

nppStatus = nppiRotate_8u_C1R ((Npp8u*) dev_src, srcSize=(8984,6732), srcImageDevice.pitch=9216, srcRect=(0,0,8984,6732),

 (Npp8u*) dev_dst, dstImageDevice.pitch=11264, dstRect=(0,0,11208,10940), 180*angle/M_PI, shiftx=6356.38, shifty=13264.847, NPPI_INTER_NN);

nppStatus returns -7.

Any help is much appreciated, thanks for advance.

Hi, I’ve added a reproducer test for your problem to our 4.1 test bench but I’m not able to reproduce the behavior you’re describing. With the paramters as given by you I do get a NPP_WRONG_INTERSECTION_QUAD_WARNING as the return value. That is because the transformed bounding-box of our source rectangle is:

[6356.3800000000001, 12882.362253856812] x [15617.830890862922, 19989.742754279952]

The bottom of the BBox is at 12882 but the top of your destination image is at 10940, so there is no overlap and thus running the primitive wouldn’t change the result image at all.

I did tests with a destination image width of 11205 and 11208. I wasn’t sure if maybe you had a typo in the allocation size, since your specifying the larger 11208 as the destination rectangle’s width.

Is there any chance that you’re not passing the values you think you’re passing? I did step through the code and the only places I found that would return the NPP_STEP_ERROR are checking if the stride is smaller than one ROI line worth of data or if the stride is negative.

Regarding the memcpy loop, the CUDA runtime has a 2D memcopy function that allows specification of two different strides: http://developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/online/group__CUDART__MEMORY_g17f3a55e8c9aef5f90b67cdf22851375.html

Thank you for your reply.

11205 was a typo.

Thank you for pointing to the intersection problem, I corrected how to choose the angle and shift.

Also thank you for telling me about the 2d memcpy function, it is much much faster.

Actually the problem is still there, but if I made the destination the same size as source, I get correctly rotated and shifted image without errors or warnings.

So, this works:

nppStatus = nppiRotate_8u_C1R ((Npp8u*) srcImageDevice.data, srcSize=(8984, 6732), srcImageDevice.pitch=9216, srcRect=(0,0,8984, 6732),

										 (Npp8u*) dstImageDevice.data, dstImageDevice.pitch=9216, dstRect=(0,0,8984, 6732),

										 angleDegrees=-139.82, shiftx=10095.696649759473, shifty=3039.4975383332549, NPPI_INTER_NN);

But, this returns the usual -7:

nppStatus = nppiRotate_8u_C1R ((Npp8u*) srcImageDevice.data, srcSize=(8984, 6732), srcImageDevice.pitch=9216, srcRect=(0,0,8984, 6732),

										 (Npp8u*) dstImageDevice.data, dstImageDevice.pitch=11264, dstRect=(0,0,11208, 10940),

										 angleDegrees=-139.82, shiftx=13414.770257552327, shifty=6033.5459020665076, NPPI_INTER_NN);

Note that in the later example I choose the width and height that should contain the whole image rotated around the center.

I made a break point and copied these values so I am pretty sure of these values.

In the output of the debug window, I get this dll loaded:

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.1\bin\npp64_41_28.dll

Is not this the same version that you made your tests on?

Hi,

I tested this in our 4.1 tree. This tree should have the exact same code as what was shipped. There are some minor differences in the build process on my local machine where I conducted the test vs the machines the build the final releases. I’d say the chances of this being the reason for not replicating are small.

What are the new angle and offsets your using? I would like to try to reproduce using those.

Also: Have you been using other NPP functions without problems or is Rotate the only NPP function you’re using so far? I’m asking because that could give us an indication of maybe something else being the issue.

Thank you for your reply.

The angle and offsets I used are as in my previous post.

This works:

nppStatus = nppiRotate_8u_C1R ((Npp8u*) srcImageDevice.data, srcSize=(8984, 6732), srcImageDevice.pitch=9216, srcRect=(0,0,8984, 6732),

                                                                                 (Npp8u*) dstImageDevice.data, dstImageDevice.pitch=9216, dstRect=(0,0,8984, 6732),

                                                                                 angleDegrees=-139.82, shiftx=10095.696649759473, shifty=3039.4975383332549, NPPI_INTER_NN);

But, this returns -7:

nppStatus = nppiRotate_8u_C1R ((Npp8u*) srcImageDevice.data, srcSize=(8984, 6732), srcImageDevice.pitch=9216, srcRect=(0,0,8984, 6732),

                                                                                 (Npp8u*) dstImageDevice.data, dstImageDevice.pitch=11264, dstRect=(0,0,11208, 10940),

                                                                                 angleDegrees=-139.82, shiftx=13414.770257552327, shifty=6033.5459020665076, NPPI_INTER_NN);

The only other nppi function I used beside nppiRotate_8u_C1R is nppiSet_8u_C1R like this:

nppStatus = nppiSet_8u_C1R (0, (Npp8u*) dstImageDevice.data, dstImageDevice.pitch, dstSize);

This function succeeded in both mentioned cases.

Also I forgot to mention I am using Visual Studio 2010.

There is an important update, that when compiling the same exact code using cuda toolkit 4.0, the problem disappears and it works perfectly, while compiling using cuda toolkit 4.1 always gives the return code -7.

Any suggestions will be much appreciated, thanks for advance.

The different behavior between 4.0 and 4.1 does not surprise me. The Rotate primitives were essentially rewritten between those two releases. The 4.0 implementation (though it may not have had the issue you’re experiencing now) had all kinds of problems.

Regarding repro of your issue: I overlooked the angle you provided previously. After plugging that into the reproducer test I’m now seeing the -7 return. In other words your problem reproduces. I’m investigating now.

This is indeed a bug in the 4.1 implementation. We’re erroneously checking if the line stride of the source image could accommodate the destination ROI. This slipped by us because we were reusing the error checking code from point-wise filters (e.g. AddC).

This is a simple fix and I’m currently investigating if we’ll be able to get it into the next release, but it might be too late for that. The code for that release has been frozen for a while now. It is definitely going to be fixed in the release after that.

If the fix does not make it into the next release, you would have to work around this issue. The workaround is essentially what you have already discovered, i.e. to use a destination image that is not larger than the source. Probably the easiest to implement work-around would be to allocate the source image at the same width as the destination image. It will not matter, if only a smaller part of the lines is being used. So it’s basically that you want a larger line-stride (step) but you don’t have to really make use of the additional storage.

Thank you very much, the work around is working.