assign _device_ _constant array size at runtime

pathfinder02 · August 30, 2008, 2:10am

Hi, I have a device constant 1D array declared in the kernel file, and it’s loaded by cudaMemcpyToSymbol command on the host. It’s a separable convolution kernel, and the kernel is first calculated and created on the host and then copied to the device constant 1D array which is declared already. My problem is if the kernel size on the host is to change how do I change the kernel size in the device constant memory. This all happens at runtime but the device constant memory is assigned at compile time.
Is there a way I can do that?

On the CUDA programming guide, under section 4.2.2.4, it says that the shared and constant memory are static storage, but then above that section it gave an example of dynamically allocated shared memory example that is declared as extern variable.
I am just a little confused by this.
Any help is appreciated, thanks.

James_Malcolm1 · August 30, 2008, 3:39am

I think that you’re correct in there being no way to resize a constant array.

However, since you’re using it to store the kernel, it’s likely you’re only using a few dozen bytes, correct? Since you have 64kb of constant memory available on the GPU in Compute Capability 1.X, why not just statically allocate a kernel larger than you’d ever need; then only reference into what you actually populate? In the rare case that the needed kernel is larger than what you’ve statically allocated, just fall back to using regular global device memory.

Alternatively, you can safely push over about 200 bytes directly into shared extern memory before launching your kernels (see more). If your kernels can fit into that, it’s fast since that part of the call stack is transferred over anyway.
-James

pathfinder02 · August 30, 2008, 8:35am

I think that you’re correct in there being no way to resize a constant array.

However, since you’re using it to store the kernel, it’s likely you’re only using a few dozen bytes, correct? Since you have 64kb of constant memory available on the GPU in Compute Capability 1.X, why not just statically allocate a kernel larger than you’d ever need; then only reference into what you actually populate? In the rare case that the needed kernel is larger than what you’ve statically allocated, just fall back to using regular global device memory.

Alternatively, you can safely push over about 200 bytes directly into shared extern memory before launching your kernels (see more). If your kernels can fit into that, it’s fast since that part of the call stack is transferred over anyway.

-James

[snapback]433111[/snapback]

I tried to compile the code with the statement below and it did allow it, so what does this statement declare? and is it useful to solve my problem?

extern device constant float *d_Kernel;

I like your idea of just allocate one big kernel, 64kb is more than I need.

Would you say that it is the solution that would run the speediest?

James_Malcolm1 · August 30, 2008, 4:48pm

Did it produce an error message? I did a quick experiment on my laptop, and that statement compiles on my machine, but it goes haywire when I run it. I think you always have to declare the size you want. Also, you can drop the “extern” if it’s only declared in one spot. And you can drop the “device” since “constant” implies it’s out on the device.

__constant__ float d_Kernel[15*15];

Yes, 64kb is definitely more than you need. I’m guessing you won’t convolve with anything larger than 15x15 (using simple loop-based convolution). So maybe allocate 15*15=225 floats for your kernel, and define some macro for access:

__constant__ kernel[15*15];

#define KERN(x,y,h)  kernel[x*h + y]; // for whatever height you're currently using

Then, if you’re kernel is larger than 15x15, put the kernel in global memory.

If you’re going to use constant or global device memory like that, you should have one of the threads first pull it all into shared memory for each block. This will avoid bank conflicts and load stalls.

__global__ void kernel(...){

  extern __shared__ float *my_kernel;

  if (threadIdx.x == 0) {

    for (..) my_kernel[i] = kernel[i];

  }

  __syncthreads();

  // ... use kernel

}

I would suggest using a texture for even faster convolution, but this technique here should be a good start.