Problem on psaaing memory from host to device

Hi all,

I am new on cuda coding and currently encounter some problems on passing memory.

My problem is that I need to pass a large amount of variables into kernel for execution, but the total memory size exceeds the upper limit that a kernel can transfer at one time. My code is like this:

global void kernel( float *a0, float *a1, …, *float a18, float *b0, float *b1, …, *float b18, float *k1, float *k2, float *k3, float *k4, float k5)
{
kernel execution…
}

there are 42 float pointers and 1 float needed to be passed to the kernel, but error will occur while compiling. If there is any other method to solve my problem, can I use a struct to solve this?

Please help this issue, and thanks for all of your help.

Hi,

Yes you certainly can address this issue by packing all or a part of your arguments into a structure allocated on the device, and only pass a pointer to this structure.

In essence, you would do it like this:

struct kernel_args {

    float *a0, *a1, ...,*a18;

    float *b0    ...;

};

...

kernel_args h_args, *d_args;

cudaMalloc(&d_args, sizeof(kernel_args));

cudaMalloc(&(h_args.a0), ...);

...

//transfer the data you need on the device memory

cudaMemcpy(h_args.a0, ..., cudaMemcpyHostToDevice);

...

cudaMemcpy(d_args, &h_args, sizeof(kernel_args), cudaMemcpyHostToDevice);

kernel<<<grid,threads>>>(d_args);

This works but you might also want to reconsider whether you actually need to transfer that many pointers in the first place.

Are you targeting Fermi (and newer) devices? Compile with -arch sm_20 and you should be able to pass that many arguments. The limit is 4k for sm_20 and 256 for sm_1x (see D.2.4.1 in the programming guide).

If you need to support older hardware, I would recommend creating a struct of pointers and storing it in constant memory.

thanks for all your help, those methods really work. I will also reconsider is there any other way to reduce the variables I need.