Passing Structs to kernel

I have a struct defined on the host and I have written a couple functions to malloc and copy across all the arrays i would like to crunch on the GPU.

typedef struct {
double* c1x;


void allocate_device_arrays(DEVICE* device, PRE_FACS* prefacs){
CUDA_SAFE_CALL(cudaMalloc((void **)&device->c1x, size_c1s));

void memcopy_device_arrays(DEVICE* device, PRE_FACS* prefacs){
CUDA_SAFE_CALL(cudaMemcpy(device->c1x, preefacs->c1x, size_c1s, cudaMemcpyHostToDevice));


in my main I initialize this struct,
DEVICE device;
call the above two functions to do all my cudaMallocs, and cudaMemcpys and then call my kernel

This works in Deviceemu mode because I am allowed to pass pointers to host memory there. As expected it bombs when I try to run it in device mode. My problem is I am unsure of what modifications I need to run this in device mode. Could I just do something like define an addition struct:

DEVICE devicerun
*) devicerun, sizeof(DEVICE));
cudaMemcpy(devicerun, &device, sizeof(DEVICE), cudaMemcpyHostToDevice);

and then call the kernel using
integrals_2e_kernel<<<>>>(devicerun); ? or is that way off the mark?
Thanks very much.

You can simply call:


provided, of course integrals_2e_kernel is declared as

__global__ void integrals_2e_kernel(DEVICE dev)

and have your kernel take the struct as an argument, instead of pointer to that structure.

The structure will be copied to shared memory of every block. I guess you are going to use those pointers in the struct often anyway, so it seems shared memory will be a good place for that struct anyway.

If you don’t like using shared memory, you can also try creating DEVICE in constant memroy.

The first approach doesnt work because device is in host memory, and you’re passing its address to the kernel, then it should really crash. The second approach works because devicerun is in device memory, and I do it this way and I think its the right way to go. You can hide the workaround by creating a local object inside a function and then copying it to device, something like:

struct Thing


	float* stuff;


void InitializeThing(Thing** thing, float* stuff, int size)


	Thing t;//local thing

	//allocate mem in device and store the pointer in t.stuff and copy data to it

	cutilSafeCall(cudaMalloc((void**)&t.stuff, size*sizeof(float)));

	cutilSafeCall(cudaMemcpy(t.stuff, stuff, size*sizeof(float), cudaMemcpyHostToDevice));

	//now alloc mem in dev and copy t to thing

	cutilSafeCall(cudaMalloc((void**)thing, sizeof(Thing)));

	cutilSafeCall(cudaMemcpy(*thing, &t, sizeof(Thing), cudaMemcpyHostToDevice));

	//thing is now in device mem and thing.stuff too


Great! Thanks to both of you. I’ve tried it both ways. Could you please explain me a little more your comment:

“If you don’t like using shared memory, you can also try creating DEVICE in constant memroy”

I actually would prefer to have these arrays loaded into global memory and then load the different elements of the array into shared memory in the kernel rather than transfer these arrays on to each block straight from the host (i think that space shared memory space is too valuable)

if I use the second method and declare

const DEVICE devicerun

then pass it to the kernel

integrals_2e_kernel<<<>>> (devicerun)

will the compiler automatically know to place all those arrays into global memory i.e. local, texture, or constant?
Thanks very much again.

Well, unless your struct is very big I wouldn’t care that much about loosing few bytes of your shared memory.After all, shared memory is there to be used, not to be avoided at any cost :)

Anyway, if you want to try constant memory, the way to do is (I believe):

__constant__ DEVICE constDev;


int main() { [...]



Note, you passs the const variable to cudaMemcpyToSymbol, not a pointer to it!

The constDev is a device variable visible from all kernels and it exists till your application ends. So, for example, if you launch several kernels and your constDev values do not change, you do not have to set them again and again from the host.

First access to constDev from your kernel will be as slow as accessing global memory, but then that data will be cached on the chip, so second access will be much faster.