Passing variables into kernel over 256 bytes


I’m working on a problem of how to pass a large number of variables into the kernel.
By large I mean 4 2d arrays of 10x100 each, and around 32 other double variables.

I have seen the suggestion of using struct and I have attempted to use these to pass the arrays in.
However each time I try to do this the kernel just crashes.

Another thought is that once the majority of variables are in the kernel they will not be modified. But they are not able to be declared as constants.
Is there a solution I could use with global memory that can be set through input of .csv files?

Basically read in the files and data. set the data and store it in global memory so that all threads can access them without having to pass them through the kernel as parameters? I could really use some help with this. Any suggestions or thoughts or tips are greatly appreciated.

Thank you.

Simply pass 1 pointer as a parameter to the kernel.

The pointer points to cuda device memory, which can be allocated by you, use the copy functions to copy your data into that device memory.

Thanks for the read.

I have been passing pointers.

double *dev_dXn;

dXn = 0.25;									//delta X (change in X)

CUDA_CALL(cudaMalloc((double**)&dev_dXn, size * sizeof(double)));

CUDA_CALL(cudaMemcpy(dev_dXn, &dXn, size * sizeof(double), cudaMemcpyHostToDevice));

then I pass dev_dXn into the kernel. However I have many variables I need to pass into the kernel. Well over the 32 doubles.

I’m looking for help with creating a struct to pass the variables or alternate options to pass large numbers of variables into the kernel.

You certainly could wrap things in a struct, copy it to device memory, and then pass the address of the struct to your kernel.

Another way to do it is to copy things to a constant symbol, which can then be used in your kernel. A constant symbol can have a maximum size of 64K, but is limited in the sense that different threads will be serialized if they do not access the same part of the constant object at the same time. But there’s also an advantage: it doesn’t take up space in L1. It is cached separately and is just as fast as L1 when cached. The constant cache size of each MP is 8KB.

Actually, the compiler itself uses constant memory to pass parameters. The size limit of 256Bytes is artificial.

On Fermi. But on older hardware, kernel arguments are passed in shared memory and the size limit is very real.

Thank you hyqneuron.

I have just got my struct working (was having issues which was why I was looking for other options)
Thanks for the information though I’m sure that will be very useful further on down the road for me.