I am trying to use nppiRotate_8u_C1R, but I am always getting NPP_STEP_ERROR.
I am 100% sure that the step I am using is not 0 and is not less than the width.
I first used cudaMalloc, and the step was equal to the width, then I tried nppiMalloc_8u_C1 and used the pitch returned from it, but I got the same NPP_STEP_ERROR error.
The destination image width and height is calculated to contain the whole rotated image, I mean the destination dimensions is not equal to the source dimensions, could it be the problem?
Hi, can you please provide the following information so we can investigate your problem:
What version on NPP?
What OS?
What are the exact parameters you’re passing? (For the pointer parameters I’d like to know what parameters were used to allocate and if there was pointer arithmetic performed on those pointers prior to passing them to the primitive.)
I know the following is inefficient, but I wanted to know where is the problem, so I used nppiMalloc. Since I have different pitch for device and host I had to make this loop.
for(int i = 0; i < srcImageDevice.height; i++)
{
cudaStatus = cudaMemcpy(dev_src+srcImageDevice.pitch*i, srcImage.data+srcImage.pitch*i, srcImage.width, cudaMemcpyHostToDevice);
}
Hi, I’ve added a reproducer test for your problem to our 4.1 test bench but I’m not able to reproduce the behavior you’re describing. With the paramters as given by you I do get a NPP_WRONG_INTERSECTION_QUAD_WARNING as the return value. That is because the transformed bounding-box of our source rectangle is:
[6356.3800000000001, 12882.362253856812] x [15617.830890862922, 19989.742754279952]
The bottom of the BBox is at 12882 but the top of your destination image is at 10940, so there is no overlap and thus running the primitive wouldn’t change the result image at all.
I did tests with a destination image width of 11205 and 11208. I wasn’t sure if maybe you had a typo in the allocation size, since your specifying the larger 11208 as the destination rectangle’s width.
Is there any chance that you’re not passing the values you think you’re passing? I did step through the code and the only places I found that would return the NPP_STEP_ERROR are checking if the stride is smaller than one ROI line worth of data or if the stride is negative.
Regarding the memcpy loop, the CUDA runtime has a 2D memcopy function that allows specification of two different strides: CUDA Toolkit Documentation
Thank you for pointing to the intersection problem, I corrected how to choose the angle and shift.
Also thank you for telling me about the 2d memcpy function, it is much much faster.
Actually the problem is still there, but if I made the destination the same size as source, I get correctly rotated and shifted image without errors or warnings.
I tested this in our 4.1 tree. This tree should have the exact same code as what was shipped. There are some minor differences in the build process on my local machine where I conducted the test vs the machines the build the final releases. I’d say the chances of this being the reason for not replicating are small.
What are the new angle and offsets your using? I would like to try to reproduce using those.
Also: Have you been using other NPP functions without problems or is Rotate the only NPP function you’re using so far? I’m asking because that could give us an indication of maybe something else being the issue.
There is an important update, that when compiling the same exact code using cuda toolkit 4.0, the problem disappears and it works perfectly, while compiling using cuda toolkit 4.1 always gives the return code -7.
Any suggestions will be much appreciated, thanks for advance.
The different behavior between 4.0 and 4.1 does not surprise me. The Rotate primitives were essentially rewritten between those two releases. The 4.0 implementation (though it may not have had the issue you’re experiencing now) had all kinds of problems.
Regarding repro of your issue: I overlooked the angle you provided previously. After plugging that into the reproducer test I’m now seeing the -7 return. In other words your problem reproduces. I’m investigating now.
This is indeed a bug in the 4.1 implementation. We’re erroneously checking if the line stride of the source image could accommodate the destination ROI. This slipped by us because we were reusing the error checking code from point-wise filters (e.g. AddC).
This is a simple fix and I’m currently investigating if we’ll be able to get it into the next release, but it might be too late for that. The code for that release has been frozen for a while now. It is definitely going to be fixed in the release after that.
If the fix does not make it into the next release, you would have to work around this issue. The workaround is essentially what you have already discovered, i.e. to use a destination image that is not larger than the source. Probably the easiest to implement work-around would be to allocate the source image at the same width as the destination image. It will not matter, if only a smaller part of the lines is being used. So it’s basically that you want a larger line-stride (step) but you don’t have to really make use of the additional storage.