Memory allocation from Device onto Device (Global) Memory

Hi All,

I am working on an image processing code on the GPU. The input is a large image. My application needs multiple copies of this large image on the GPU. Needless to say I am looking to optimize the memory usage.

We know that the overhead of data transfer between the device and the device memory is much lower than the overhead of data transfer between the device memory and the host memory. So my thought was to transfer only a single copy of the image from host to device and then make multiple copies of the same on the device. But I am unable to perform an “equivalent malloc” from the kernel code on the device.

Suppose I need a “copy_of_image” of size “numPixels” on the GPU. I may declare a pointer to copy_of_image as a device variable outside of all functions or as a variable inside the kernel function. In any case, I can’t allocate memory of size “numPixels*sizeof(float)” for the variable “float * copy_of_image” in the kernel function.

What functions or CUDA calls can I use to do that? I am not using the CUDA Driver API since I am using CUDA Runtime API and I understand that I can use only one of them in an application.

Need some ideas on device to device-memory interaction, especially memory allocation.

Thanks & regards,


I was looking at the “Bandwidth Test” code in the SDK. From the “DeviceToDeviceTransfer”, I gather that we do cudaMalloc and then cudaMemcpyDeviceToDevice. But all this happens even before invocation of the kernel. I wonder, how does it have less overhead then? What is the sequence of operation in this case?

Also, the DeviceToDevice Bandwidth is around 57948.4MB/s. If I have three consecutive “cudaMemcpyDeviceToDevice” calls (with far less total size of the data to be transferred), will these all happen in one access? If not, how do I exploit this large bandwidth?

Thanks & regards,


You cannot allocate global memory inside kernels at all. You have to do it on the host side with cudaMalloc, but then you can pass pointers to your kernel function, along with array sizes. The kernel can then do the array copy, which can be best for bandwidth depending on your data.