Passing an already located data on GPU by cv::cuda::GpuMat to a Cuda Kernel


I have a GpuMat of OpenCV type created as

cv::cuda::GpuMat d_im(h_im.size().height, h_im.size().width, CV_8UC1);

that I am doing some image processing operations using opencv::cuda and then according to the OpenCV documentation I have tried to pass it
directly to the kernel function below as :

Kernel_func<<<grid_size, block_size, 0, stream>>>( d_im.ptr<uint8_t>(), output);

but I got wrong results.

However, it was okay if the new d_im is downloaded from GPU to CPU and then copy it again to the GPU by cudaMemcpy as in this code snippet below (with no problems). I know this is not okay to do.

CUDA_CHECK_RETURN(cudaMemcpyAsync(input, h_im_new.ptr<uint8_t>(), sizeof(uint8_t)*size, cudaMemcpyHostToDevice,stream1));

My global function prototype is :

__global__ void Kernel_func(const uint8_t *input, const uint8_t *output);

I am not sure what is wrong in this case, please anyone had similar issue or any suggestions. Thanks for your help.