How to allocate memory dynamically in cuda kernel

xiaolaji · February 22, 2009, 9:13am

Hi everyone:
I run the separable convolution code in the sdk example. And I found the kernelradius is specified regardless of the smoothlevel of the Gaussian kernel(the sigma). If I want that the kernelradius is depending on the sigma of the Gaussian . For example, kernelradius = 4 * sigma. Then the kernel_w and kernel radius become variables not constant value. As we all know, in the C language, the size of array must be a constant value. For example, int a[5]; not kernelradius = 4 * sigma; int a[kerelradius];.
In C language , I can use â€œmallocâ€œ function to allocate memory dynamically. But in cuda kernel , which function I can use to allocate memory dynamically.

The following is some sentence in the separable convolution code. If the kernel_w and kernel _radius is variables, the following sentence is wrong. How can I change it, which function I can use to allocate memory dynamically in cuda kernel.
device constant float d_Kernel[KERNEL_W];
shared float data[KERNEL_RADIUS + ROW_TILE_W + KERNEL_RADIUS];

Thank you and best regards.

e.ping · February 23, 2009, 4:47pm

Hi everyone:

I run the separable convolution code in the sdk example. And I found the kernelradius is specified regardless of the smoothlevel of the Gaussian kernel(the sigma). If I want that the kernelradius is depending on the sigma of the Gaussian . For example, kernelradius = 4 * sigma. Then the kernel_w and kernel radius become variables not constant value. As we all know, in the C language, the size of array must be a constant value. For example, int a[5]; not kernelradius = 4 * sigma; int a[kerelradius];.

In C language , I can use â€œmallocâ€œ function to allocate memory dynamically. But in cuda kernel , which function I can use to allocate memory dynamically.

The following is some sentence in the separable convolution code. If the kernel_w and kernel _radius is variables, the following sentence is wrong. How can I change it, which function I can use to allocate memory dynamically in cuda kernel.

device constant float d_Kernel[KERNEL_W];

shared float data[KERNEL_RADIUS + ROW_TILE_W + KERNEL_RADIUS];

Thank you and best regards.

I would recommend defining the d_Kernel array to be some maximum size - the maximum for your target hardware for example. Then, within your kernel, you can access only the elements of that array that make sense for the kernel size.

For shared memory, there are some details in the programming guide about specifying the amount of shared memory as part of the <<< >>> execution configuration syntax. (See Section 4.2.3 in the CUDA 2.0 Programming Guide.)

Another technique that might be useful is the use of C++ templates - you can write a template kernel function that takes an unsigned integer template parameter. You can then use that template parameter as a constant within the kernel template. Then, in your host code, you can use a “switch” statement to determine which kernel to invoke. This technique can create long compile times and large .cubin file sizes when you have a LARGE number of template instantiations, since nvcc will generate a separate kernel for each template instantiation.

Hope this helps…

Jeremy Furtek

xiaolaji · February 24, 2009, 2:45am

I would recommend defining the d_Kernel array to be some maximum size - the maximum for your target hardware for example. Then, within your kernel, you can access only the elements of that array that make sense for the kernel size.

For shared memory, there are some details in the programming guide about specifying the amount of shared memory as part of the <<< >>> execution configuration syntax. (See Section 4.2.3 in the CUDA 2.0 Programming Guide.)

Another technique that might be useful is the use of C++ templates - you can write a template kernel function that takes an unsigned integer template parameter. You can then use that template parameter as a constant within the kernel template. Then, in your host code, you can use a “switch” statement to determine which kernel to invoke. This technique can create long compile times and large .cubin file sizes when you have a LARGE number of template instantiations, since nvcc will generate a separate kernel for each template instantiation.

Hope this helps…

Jeremy Furtek

Thank you very much! Your suggestion is very useful for me! :rolleyes: