Kernel requiring large number of parameters

Hi, I’m looking to convert a program to CUDA. I have a function that does a number of calculations based on a series of quite large data arrays and variables (some of which are changed during the calculation). All but one of these is available globally in my normal program and so the function has only one parameter.

What I want to do is make this function a kernel. What I would like to know is the best way to pass all of the information needed to the device for use by the kernel function. In particular I was wondering how parameters passed to the kernel directly are stored on the device and what sort of limit there is on this?

Thanks for any help.

All data that kernel is about to use must be put on the device. Basically, this can be done using cudaMalloc/cudaMemcpy (global memory). Look at the BlackScholes sample in the SDK - very simple but informative sample. You’ll see how to handle the input data set of any size on the arbitrary grid, coalesced memory access pattern and host<->device data transfer practice.

Also, for the constant data you may use constant memory (please, refer to the programming guide).

Hi. You will need to do a couple things.

  1. For each global array, you will need to allocate memory which resides on the actual GPU card. You will then need to copy the data from each array into the memory residing on the device.

  2. You mentioned that you also use a couple global variables (I’m assuming they aren’t arrays). If the function does NOT need to change the variable, you do not need to copy the variable to device memory. Just include the variable as a parameter. However, if the function actually needs to change the variable, you will need to allocate memory on the device, and copy the variable to the allocated memory locations.

Okay, so now you have a bunch of arrays and variables which reside on the GPU card. Great! Now simply pass in the pointers which reside on the GPU card as normal parameters. The kernel can read from and write to the arrays just like in any C program.

The limit on this would be the amount of global memory your card has available. Unless you are working with verrrrry large arrays (or a large matrix more likely) you should be ok with the amount of global memory(not the quickest, but necessary for large matrices/very large arrays).

I would be careful with this. From what I’ve seen, function parameters are stored in shared memory or registers. If you need the shared memory space or registers, it might be better to use constant memory.

Function arguments are stored in shared memory, and are limited to 256 bytes.

Thanks for the replies everyone. They have been very helpful.

How is it possible to be stored in shared memory, since you allocate space in advance in global memory (with cudaMalloc)?

Maybe you guys have an answer to another related question of mine… take a look here if you want:

The pointer to global memory itself is stored in shared memory (since every thread needs to have access to that pointer).

I have a related question to this then. Do you get bank conflicts when reading function parameters?

No. Because all threads in a warp are reading the same argument at the same time the shared memory broadcast mechanism will be invoked.

I am also converting a program to run with CUDA and I have a question that is kind of related to this thread discussion.

I have a struct object I am passing to the kernel that is of decent size. I realized in the middle of design that many of the parameters within the struct will need to be updated on a per-thread basis.

Rather than make a copy of the struct for each thread, which would take up substantial shared memory, I decided to pull the parameters that needed to be modified out of the struct and try to make them global. The thing is, they need to be global on a per-thread basis and be accessible to all device functions. I do not want to use device to declare these variables because then they would be accessible by all threads.

For example, can I do something like this:

__device__ void func()


 Â  :: update variable x FOR THIS THREAD ONLY


__global__ void kernel_func(struct Z)


 Â  __shared__ struct c_Z;

  :: declare variable x here such that it is accessible from any device function FOR THIS THREAD ONLY



Or must I declare a shared memory array for each of those values? The size of which would be the number of threads per block - not an ideal solution for me because I am going to be using substantial shared memory as it is.

Thanks for any help.

One thing I also tried was:

__device__ void GPU_CIRCUIT_StatisticalDelay()


	GATE* gptr = NULL;

	NorRandVar01[threadIdx.x] = AngelaNormalDis(0, 1);


__global__ void DefectSim_kernel(GPU_C* g_Circuit) 


__shared__ double NorRandVar01[BLOCK_SIZE];

	NorRandVar01[threadIdx.x] = 0.0;



But I get the error identifier “NorRandVar01” is undefined in the StatisticalDelay function. Can shared memory not be identified from a calling function?


You realize that wouldn’t work in C either, right? If it’s really just two lines, I wouldn’t make it a function call in the first place. You could try passing the shmem array as an argument to StatisticalDelay, but it probably won’t work in the current version of CUDA (“warning: can’t determine address location, assuming global memory” or something to that effect).

The function is more than just those lines, I just simplified the case for this post. I realize it wouldn’t work in C typically unless you pass it to the function, but I didn’t know if shared memory in CUDA worked differently.

Right now I am trying to declare the shared array outside of the kernel to see if that approach will work.