How to access dwImageCuda pixel value and create 2D picture matrix?


I’m trying to access dwImageCuda pixel data to create a 2D matrix of picture (similar to cv::Mat format) in order to send that data to v4l2loopback and stream the data to the network.

First I tryed to convert dwImageCuda image to cv::Mat but all I get is strange black and white image. Example code:

cv::cuda::GpuMat d_src(img_cuda->prop.height, img_cuda->prop.width, CV_8UC4, img_cuda->dptr[0]);
cv::cuda::GpuMat d_dst;
cv::cuda::bilateralFilter(d_src, d_dst, -1, 50, 7);
cv::Mat dst;;
cv::imshow(“test”, dst); cv::waitKey(0);

Then I tryed to access pixel data from dwImageCuda dptr and construct the matrix by myself, but I couldn’t find a way to do that.
Am I doing something wrong? What is the correct way of doing this?

Thanks in advance!

Hi anja,
Just for your information, CUDA can store data in pitch linear(dptr) format and block linear format(cudaArray).
Can you please copy the data from cudaArray to host memory and check if it actually contains data instead of dptr? Please refer to appropriate cuda Memcpy function at

have you solved this issue?

In the question, img_cuda->dptr[0] points to the first image plane only. All image planes are required to construct the color image properly.
It is also possible that the pixel values are of floating type (between 0 & 1) and thus direct conversion of those to CV_8U might not yield appropriate results.

Here is a simple example to copy data from dwImageCUDA into a cv::Mat or cv::cuda::GpuMat - pixel by pixel.

The format of the input dwImageCUDA must be known as the memory mapping depends on format.
The format for dwImageCUDA *cudaImage; can be found at cudaImage->prop.format.

The copy can be performed in following way:
The format of dwImageCUDA is considered to be DW_IMAGE_FORMAT_RGB_FLOAT16_PLANAR, as an example. That is a 3 channel RGB image with pixel values in float16 type, in range 0 to 1.

__device__ unsigned char __uc(__half const x){
	unsigned int i8 = (unsigned int)(__half2float(x) * 255.0);
	return (unsigned char) (i8 > 255) ? 255 : i8;

__global__ void cudaRgb2MatBgr(
	void * const red, // input
	void * const green, // input
	void * const blue, // input
	size_t const pitch, // input
	unsigned int const rows, // input
	unsigned int const cols, // input
	unsigned char * const data, // output
	size_t const step // input
	// global thread ID
	long int const id = blockIdx.x * blockDim.x + threadIdx.x;

	// indices for mat
	int const rm = id / step; // row in Mat
	int const cm = (id - (rm*step)); // column in Mat
	int const ch = cm % 3; // color channel in Mat (3 channels for BGR image)

	// indices for cudaImage
	int const rc = id / pitch; // row in cuda image
	int const cc = (id - (rc*pitch)); // column in cuda image

	// pointers to color planes
	__half* r16 = (__half*) red; // pointer to red image plane
	__half* g16 = (__half*) green; // pointer to green image plane
	__half* b16 = (__half*) blue; // pointer to blue image plane
	__half* bgr[3] = { b16, g16, r16 };

	// check if indices are within bounds
	if( (rm < rows) && (cm < cols*3) ){
		data[rm*step + cm] = __uc(bgr[ch][rc*pitch/3 + cc/3]); // copy data

// -------------------------------------------------

int cudaImage2Mat(dwImageCUDA *cudaImage, cv::Mat &im){
        // dwImageCUDA * is input
        // cv::Mat is output

        // Create an empty cv::cuda::Mat with the dimensions of input dwImageCUDA
	cv::cuda::GpuMat gim(imProp.height, imProp.width, CV_8UC3, cv::Scalar(0));

	int blockSize, gridSize;

	// Number of threads in each thread block
	blockSize = 1024;

	// Number of thread blocks in grid
	unsigned int n = gim.rows * gim.step;
	gridSize = (int)ceil((float)n/blockSize);

	// call the kernel
	cudaRgb2MatBgr<<<gridSize, blockSize>>>(
		cudaImage->dptr[0], // red plane
		cudaImage->dptr[1], // green plane
		cudaImage->dptr[2], // blue plane
		cudaImage->pitch[0], // pitch is usually same for all planes (but verify that)
		imProp.height, // image rows
		imProp.width, // image columns, // receive pixels values here
		gim.step // step is pitch for cv::cuda::Mat

	cudaDeviceSynchronize();; // copy the data into cpu Mat (cv::Mat)

	return 0;

Note: float16 (half precision) support for CUDA is available through #include <cuda_fp16.h>

Hi sumit.dey,

Many thanks for the sharing!