Parameters passed to a CUDA kernel exceed 256 bytes.

Hey,

I am doing a Molecular Dynamics simulation with CUDA. For the task, I am required to pass parameters to the kernel which exceed 256 bytes, the size allowed by CUDA. Can anyone please help me out as to what I can do to overcome this problem.

Thanks

pack your arguments in a struct and pass a pointer to the struct

a.k.a. store your larger parameters in constant, global, or texture memory.

I’ve run into this limitation on almost all of my kernels to date… I tend to store matrices and the likes in constant memory now, instead of smem (formal parameter list).

Yes, even I was thinking on the same lines. But could not figure out how to implement it. I have a structure on the host. Now, how can I define a structure on the device and pass it. I mean should I copy all the elements of the device structure using cudaMemcpy or copy the structure as a whole.

Could you please elaborate a bit. This would solve my problem.

Thanks.

Help Guys. Please.

The following should be ok:

typedef struct MyStruct

{

  float *pData;

  //... any number of arrays that you'd like....

};

MyStruct hostStruct;

MyStruct *deviceStruct;

int iSize = 100 * sizeof( float );

float *pInputData = new float[ 100 ];  // and fill it with data...

cudaMalloc( ( void ** )&( hostStruct.pData ), iSize);

cudaMemcpy( hostStruct.pData, pInputData, iSize, cudaMemcpyHostToDevice );

// Now copy the host structure into the device structure...

cudaMalloc( ( void ** )&( deviceStruct ), sizeof( MyStruct ) );

cudaMemcpy( deviceStruct, hostStruct, sizeof( MyStruct ), cudaMemcpyHostToDevice );

myKernel<<< ... >>>( deviceStruct, ... );

...

In your kernel you use it like:

deviceStruct->pData

eyal

It would probably be better to have this struct in constant memory (copy it to device using a copy to symbol function).

By the way, would passing a host struct as a parameter work? That is if I had
myKernel<<< … >>>( hostStruct, … );
would the hostStruct be copied to the smem buffer for parameters?

I think I once tried it and the kernel crashed…

eyal

cudaMalloc( ( void ** )&( hostStruct.pData ), iSize);

cudaMemcpy( hostStruct.pData, pInputData, iSize, cudaMemcpyHostToDevice );

Should this be not, cudaMalloc((void **)&(deviceStruct->pData), iSize);

and, cudaMemcpy(deviceStruct->pData, pInputData,iSize,cudaMemcpyHostToDevice);

I do not understand why do we need two structures.Even in the hostStruct, we are using cudaMalloc to allocate memory for pData where as we are not allocating initalizing/allocating pData for deviceStruct.

This might be a very naive question but it would help me understand it much better.

Thanks.

I understand it now. What I was thinking is that we could just pass the address of the structure hostStruct but I now I understood that it would be in the CPU address space. Thanks a lot. I really appreciate it.

// Now copy the host structure into the device structure…

cudaMalloc( ( void ** )&( deviceStruct ), sizeof( MyStruct ) );

cudaMemcpy( deviceStruct, hostStruct, sizeof( MyStruct ), cudaMemcpyHostToDevice );

myKernel<<< … >>>( deviceStruct, … );

…[/code]

In your kernel you use it like:

deviceStruct->pData

eyal

[/quote]

cudaMemcpy( deviceStruct, hostStruct, sizeof( MyStruct ), cudaMemcpyHostToDevice );

Should this be: cudaMemcpy( deviceStruct, &hostStruct, sizeof( MyStruct ), cudaMemcpyHostToDevice ); or something else?

When I am passing a pointer to the structure, I am getting an error with the kernel launch.
How can I solve it?
Please help.

Try the following (works fine for my kernels):

// Allocate on the CPU RAM.

GGPUGenericSearchParams  *pHostGenericSearchParams = new GGPUGenericSearchParams ();

// Allocate the structure's pointers on the device !!

CUDA_SAFE_CALL( cudaMalloc( ( void ** )&( pHostGenericSearchParams ->m_p1 ), iSize ) );

CUDA_SAFE_CALL( cudaMalloc( ( void ** )&( pHostGenericSearchParams ->m_p2 ), iSize ) );

// Allocate the structure on the device memory and copy the host's contents into the device structure.

// since those are valid device pointers the copy should be valid.

GGPUGenericSearchParams *pDeviceGenericParams;

CUDA_SAFE_CALL( cudaMalloc( ( void ** )&( pDeviceGenericParams ), sizeof( GGPUGenericSearchParams ) ) );

CUDA_SAFE_CALL( cudaMemcpy( pDeviceGenericParams, pHostGenericSearchParams, sizeof( GGPUGenericSearchParams ), cudaMemcpyHostToDevice ) );

// call the kernel with the pDeviceGenericParams pointer....

myKernel<<< .... >>>( pDeviceGenericParams );

// in the kernel you can use it like this: pDeviceGenericParams->m_p1 ....

Hope it helps

eyal

And do I declare, GGPUGenericSearchParams in the same way as myStruct, declared previously. I declare m_p1, m_p2 there??