Kernel requiring large number of parameters

vf10a · July 1, 2008, 6:42pm

Hi, I’m looking to convert a program to CUDA. I have a function that does a number of calculations based on a series of quite large data arrays and variables (some of which are changed during the calculation). All but one of these is available globally in my normal program and so the function has only one parameter.

What I want to do is make this function a kernel. What I would like to know is the best way to pass all of the information needed to the device for use by the kernel function. In particular I was wondering how parameters passed to the kernel directly are stored on the device and what sort of limit there is on this?

Thanks for any help.

Romant · July 1, 2008, 7:09pm

All data that kernel is about to use must be put on the device. Basically, this can be done using cudaMalloc/cudaMemcpy (global memory). Look at the BlackScholes sample in the SDK - very simple but informative sample. You’ll see how to handle the input data set of any size on the arbitrary grid, coalesced memory access pattern and host<->device data transfer practice.

Also, for the constant data you may use constant memory (please, refer to the programming guide).

ColinS · July 1, 2008, 11:15pm

Hi. You will need to do a couple things.

For each global array, you will need to allocate memory which resides on the actual GPU card. You will then need to copy the data from each array into the memory residing on the device.
You mentioned that you also use a couple global variables (I’m assuming they aren’t arrays). If the function does NOT need to change the variable, you do not need to copy the variable to device memory. Just include the variable as a parameter. However, if the function actually needs to change the variable, you will need to allocate memory on the device, and copy the variable to the allocated memory locations.

Okay, so now you have a bunch of arrays and variables which reside on the GPU card. Great! Now simply pass in the pointers which reside on the GPU card as normal parameters. The kernel can read from and write to the arrays just like in any C program.

senorbum · July 2, 2008, 3:19pm

The limit on this would be the amount of global memory your card has available. Unless you are working with verrrrry large arrays (or a large matrix more likely) you should be ok with the amount of global memory(not the quickest, but necessary for large matrices/very large arrays).

StickGuy · July 2, 2008, 10:38pm

I would be careful with this. From what I’ve seen, function parameters are stored in shared memory or registers. If you need the shared memory space or registers, it might be better to use constant memory.

MisterAnderson42 · July 3, 2008, 12:09am

Function arguments are stored in shared memory, and are limited to 256 bytes.

vf10a · July 3, 2008, 3:26pm

Thanks for the replies everyone. They have been very helpful.

espe · July 3, 2008, 5:29pm

How is it possible to be stored in shared memory, since you allocate space in advance in global memory (with cudaMalloc)?

Maybe you guys have an answer to another related question of mine… take a look here if you want: http://forums.nvidia.com/index.php?showtopic=71499

E.D_Riedijk · July 3, 2008, 6:03pm

The pointer to global memory itself is stored in shared memory (since every thread needs to have access to that pointer).

StickGuy · July 10, 2008, 3:13pm

I have a related question to this then. Do you get bank conflicts when reading function parameters?

MisterAnderson42 · July 10, 2008, 3:30pm

No. Because all threads in a warp are reading the same argument at the same time the shared memory broadcast mechanism will be invoked.

Pittsburgh · September 5, 2008, 5:12pm

I am also converting a program to run with CUDA and I have a question that is kind of related to this thread discussion.

I have a struct object I am passing to the kernel that is of decent size. I realized in the middle of design that many of the parameters within the struct will need to be updated on a per-thread basis.

Rather than make a copy of the struct for each thread, which would take up substantial shared memory, I decided to pull the parameters that needed to be modified out of the struct and try to make them global. The thing is, they need to be global on a per-thread basis and be accessible to all device functions. I do not want to use device to declare these variables because then they would be accessible by all threads.

For example, can I do something like this:

__device__ void func()

{

 Â  :: update variable x FOR THIS THREAD ONLY

}

__global__ void kernel_func(struct Z)

{

 Â  __shared__ struct c_Z;

Â  :: declare variable x here such that it is accessible from any device function FOR THIS THREAD ONLY

Â  func();

}

Or must I declare a shared memory array for each of those values? The size of which would be the number of threads per block - not an ideal solution for me because I am going to be using substantial shared memory as it is.

Thanks for any help.

Pittsburgh · September 5, 2008, 5:38pm

One thing I also tried was:

__device__ void GPU_CIRCUIT_StatisticalDelay()

{

	GATE* gptr = NULL;

	NorRandVar01[threadIdx.x] = AngelaNormalDis(0, 1);

}

__global__ void DefectSim_kernel(GPU_C* g_Circuit) 

{

__shared__ double NorRandVar01[BLOCK_SIZE];

	NorRandVar01[threadIdx.x] = 0.0;

	GPU_CIRCUIT_StatisticalDelay();

}

But I get the error identifier “NorRandVar01” is undefined in the StatisticalDelay function. Can shared memory not be identified from a calling function?

Thanks.

tmurray · September 5, 2008, 5:55pm

You realize that wouldn’t work in C either, right? If it’s really just two lines, I wouldn’t make it a function call in the first place. You could try passing the shmem array as an argument to StatisticalDelay, but it probably won’t work in the current version of CUDA (“warning: can’t determine address location, assuming global memory” or something to that effect).

Pittsburgh · September 5, 2008, 6:32pm

The function is more than just those lines, I just simplified the case for this post. I realize it wouldn’t work in C typically unless you pass it to the function, but I didn’t know if shared memory in CUDA worked differently.

Right now I am trying to declare the shared array outside of the kernel to see if that approach will work.

Topic		Replies	Views
Optimizing a Kernel with a lot of variables - memory allocation. CUDA Programming and Performance	6	730	August 5, 2016
stupid memory question CUDA Programming and Performance	11	2066	July 28, 2009
passing an array to a kenel ? CUDA Programming and Performance	9	13071	June 10, 2009
Passing variables into kernel over 256 bytes CUDA Programming and Performance	5	9605	July 12, 2011
Passing variables as parameter In which memory are they stored? CUDA Programming and Performance	14	2376	August 22, 2010
Global arrays? CUDA Programming and Performance	24	10612	August 18, 2010
Defining global variables on the host and device at once? CUDA Programming and Performance	14	14012	December 19, 2020
How to pass variables to different kernal functions via global variables? CUDA Programming and Performance	8	3007	June 9, 2010
Coding Guideline for CUDA Make CUDA more readable! CUDA Programming and Performance	10	6646	October 6, 2008
Using Shared Memory in CUDA C/C++ Technical Blog	36	1921	October 8, 2020

Kernel requiring large number of parameters

Related topics