Point to pointer in device memory ?


I would like to do some image processing on N images.

What I thought to do is running N threads, which share the image data.

So, I tried to use pointer of pointer for image array to manage N images.

cudaArray ** imgs_device;

img_mem_size = width * height * sizeof(unsigned char);

// Memory allocation... 

CUDA_SAFE_CALL(cudaMalloc((void**)&imgs_device, sizeof(cudaArray*)*N));

chanDesc = cudaCreateChannelDesc(8, 0, 0, 0, cudaChannelFormatKindUnsigned);

for(i=0; i<N; i++) 

	CUDA_SAFE_CALL(cudaMallocArray(&imgs_device[i], &chanDesc, width, height) );

// Do Texture binding..

And, I pass the pointer to pointers to my kernel.


The code is compiled without errors.

But, I’m not sure this is the right way to using pointer to pointer.

Anybody knows how to use the pointer to pointer ?

Thanks in advance !!

Your code will seg fault or worse, run without crashing and seem to have random results. You have allocated imgs_device on the device and then attempt to write to the device pointer from the host with cudaMallocArray(). If you truly want pointers to pointers, you need to allocate the list of pointers both on the host and device, then allocate each of the pointers on the host list and copy it to the device. Search the forums for some example code.

As a side issue, what do you expect to do with a list of cudaArrays on the device? They can only be read when bound to a texture and only written to by host calls, so I don’t see any need to keep a list of them on the device.

Hi, thanks for your reply.

I figured out how I can use the pointer to pointers.

The code should be like this :

cudaArray ** imgs_device;

img_mem_size = width * height * sizeof(unsigned char);

// Memory allocation... 

CUDA_SAFE_CALL(cudaMalloc((void**)&imgs_device, sizeof(cudaArray*)*N));

chanDesc = cudaCreateChannelDesc(8, 0, 0, 0, cudaChannelFormatKindUnsigned);

// Store the pointers in host array.. 

cudaArray** pointers[N];

for(i=0; i<N; i++) 

	CUDA_SAFE_CALL(cudaMallocArray(&pointers[i], &chanDesc, width, height) );

// Then copy the pointers, which indicate the memory on the device. 

CUDA_SAFE_CALL(cudaMemcpy(imgs_device, pointers, sizeof(cudaArray*)*N, cudaMemcpyHostToDevice));

In this way, it seems that the code works fine.

I’d like to do some image processing on multiple images at once by using multiple threads. Thus, I need to store them to cudaArray and to access to them through textures on my kernel.

Is there any other good way to do this ?

Textures are bound by host calls, so I repeat my point that there is no purpose in storing a list of cudaArray objects in device memory.

If you are running multiple host threads, be aware that each has a separate CUDA context and you cannot share a cudaArray allocated in one host thread with another.

I would write the handling of the image lists in C++. Just use your favorite template container object: i.e. std::vector or list. Pass a single cudaArray into the kernel driver function defined in the .cu file.

Hi. MisterAnderson42.

Ok, let me explain a little more about what I want to do.

I assume that I have N images and want to detect contours of the N images.

What I have in my mind is detecting contours in N images simultaneously by using N threads on the device. I wrote two global kernels, kernel_1 is for contour detection and kernel_2 is for doing some more job with the detected contours.

My program will be something like this.

cudaArray** contours;

// memory allocation


// contour detection using CUDA

kernel_1 <<<grid_D, block_D>>> (contours); 

// Do another job with the detected contours

kernel_2 <<<grid_D2, block_D2>>>(contours);

I set the dimension of grid to 1 and the dimension of block to N (the number of images) to run kernel_1.

So N threads will run in the device and each of them will access to each image.

Since the detected contours will be used in the kernel_2, I decided to store the contour information in the global memory ( at this time) to re-use it another kernel. I think that I can save some time required for copying contour information from host to device by conducting contour detection and store them in the device.

The list of cudaArray is for storing contour information and I need to it them to pass it to kernel_2.

Since I do not know the size of the contour information, I assigned some fixed size of memory for contour (max. size, I mean).

Do you think this is right way ?

Your grid and block dimensions will vastly underutilize the GPU hardware. A grid size of 1 will only use 1 of the multiprocessors on the GPU (8800 GTX has 16 and the GTX 280 has 30 multiprocessors). In addition, a multiprocessor excutes threads in groups of 32, so you want the number of threads per block to be a multiple of 32.

I’m not familiar with contour detection algorithms, but if the number of images is greater than 30, I would suggest you give each block one image to work with. The threads in each block can then cooperate to find the contours, possibly communicating intermediate results through shared memory. The final results can be written out to global memory, as you suggest.

Storing the contour information in global memory and using it in a second kernel is a good idea. Generally, pushing as much calculation onto the GPU as is practical helps reduce slow copies between GPU and system memory.

Hi, thanks for your comment.

I know that my grid and block setting do not corresponds to CPU computing concept.

However, contour detection seems not to have much advantage from parallel processing judging from my knowledge.

Maybe, there is a way to do it faster with parallel processing.

Actually, contour detection is not much heavy work, it can run in real-time with a single image of 640x480, as I experienced.

If I run it on a CPU, all of the N images sholud be processed sequentially and it cannot be done in real-time as the number of images increases. Thus, I decided to try it on GPU with N images in parallel.

I’ll look into my contour detection code to examine that there is possibility of parallelization.