Error when using NPP with OpenCV


I am loading and manipulating images using OpenCV and want to perform morphological erosion on them using Nvidia NPP, but I always get the following error:

error: (-217:Gpu API call) an illegal memory access was encountered in function ‘download’

The error states that it fails in function download, but when setting breakpoints it actually seems to happen when calling nppiErode_8u_C1R.
This is the error in the debugger:

No source available for “void ForEachPixelByte<unsigned char, 1, ErodeReplicateBorderFunctor<unsigned char, 1> >() at 0x55555f777928”

Also, the first call does not throw an error, but the second call does (in the second iteration of the loop).

I am loading the image in a cv::Mat and upload it into a cv::cuda::GpuMat.
Then I perform operations on it using some of OpenCV’s CUDA-functions and store the resulting image in another GpuMat. This runs in a loop (the part after the first four declarations):

stream_left = cv::cuda::Stream::Null();
cv::cuda::GpuMat src = cv::cuda::GpuMat(1242, 2208, CV_8UC1);
cv::cuda::GpuMat out = cv::cuda::GpuMat(1242, 2208, CV_8UC1);
cv::Mat src_host;

// load image into src_host and perform some OpenCV-CUDA-functions ...
// ...

src = mask.createMaskUsingCUDA(image_left_gpu, stream_left);

// using NPP
cv::Mat kernel = cv::getStructuringElement(cv::MORPH_RECT, cv::Size(5, 5));

cv::cuda::GpuMat kernel_gpu;

NppiSize roi;
roi.width = kernel_gpu.cols;
roi.height = kernel_gpu.rows;

NppiSize mask;
mask.width = kernel_gpu.cols;
mask.height = kernel_gpu.rows;

NppiPoint anchor;
anchor.x = ((kernel.cols - 1) / 2) + 1;
anchor.y = ((kernel.rows - 1) / 2) + 1;

nppiErode_8u_C1R(src.ptr<Npp8u>(), static_cast<int>(src.step), out.ptr<Npp8u>(), static_cast<int>(out.step), roi, kernel.ptr<Npp8u>(), mask, anchor);, stream_left);

I already tried different values for the step, but that does not seem to change anything.

I really appreciate any help.
Thank you!

I was able to solve the problem.
The reason for the error was the anchor point, which I thought was the center of the erosion-kernel but really is

X and Y offsets of the mask origin frame of reference w.r.t the source pixel

So I changed

NppiPoint anchor;
anchor.x = ((kernel.cols - 1) / 2) + 1;
anchor.y = ((kernel.rows - 1) / 2) + 1;


NppiPoint anchor;
anchor.x = 0;
anchor.y = 0;

Also, roi is not the erosion-kernel but the region of the image that will be eroded, so I changed that to match the dimensions of the whole image.
The static_cast for the step-argument did not work either, so I changed that too:
src.cols * sizeof(uint8_t)

This works now, but I have encountered another problem:
I wanted to replace the Erosion by Opening, using nppiMorphOpenBorder_8u_C1R.
This requires a buffer, and to get its size, one should call nppiMorphGetBufferSize_8u_C1R.
In my case this does not work and I get a Segmentation Fault when using the result of that function.
Manually setting the size-argument of cudaMalloc to a high number (e.g. 2000*2000) works, but the resulting image does not look as expected (as with OpenCV morphologyEx) and it is also much slower than on my CPU (i3-4150 vs. GTX 960 2GB).
My code is:

int* bufferSize;

// get buffer size for morphological operation
nppiMorphGetBufferSize_8u_C1R(roi, bufferSize);

// allocate device memory for buffer
unsigned char *buffer;
cudaMalloc((void **) &buffer, *bufferSize);

nppiMorphOpenBorder_8u_C1R(src.ptr<Npp8u>(), src.cols * sizeof(uint8_t), roi, offset, out.ptr<Npp8u>(), out.cols * sizeof(uint8_t),
		roi, kernel.ptr<Npp8u>(), mask, anchor, buffer, NPP_BORDER_REPLICATE);


Can someone tell me how to use the nppiMorphGetBufferSize_8u_C1R function?

In case someone with the same problem stumbles upon this thread, this seems to be the correct code:

		int bufferSize;

		// get buffer size for morphological operation
		nppiMorphGetBufferSize_8u_C1R(roi, &bufferSize);
		// allocate device memory for buffer
		unsigned char *buffer;
		cudaMalloc((void **) &buffer, bufferSize);