I doubt that works “fine as expected”, you are using the incorrect size in the cudaMalloc call. As for allocating the array inside the structure, I think you will have to allocate a separate integer array, and then a call device side function to assign its address to the pointer inside the structure.
You probably don’t need to malloc the soa per se, but you do need to malloc the individual arrays inside. I would take this approach:
struct soa
{
// each member of the struct of arrays is an array:
int *member0;
int *member1;
int *member2;
etc...
int size;
};
__global__ foo(soa arrays, int *result)
{
unsigned int i = gridDim.x * blockDim.x + threadIdx.x;
// these loads will coalesce
result[i] = soa.member0[i] * soa.member1[i] + soa.member2[i];
}
// this code is paraphrased, but you get the idea
void main(void)
{
soa struct_of_arrays;
struct_of_arrays.size = 100000;
// malloc each array inside the soa individually
cudaMalloc(struct_of_arrays.member0, struct_of_arrays.size);
cudaMalloc(struct_of_arrays.member1, struct_of_arrays.size);
cudaMalloc(struct_of_arrays.member2, struct_of_arrays.size);
etc...
// malloc space for the result
int *result;
cudaMalloc(result, struct_of_arrays.size);
// launch the kernel, passing struct_of_arrays by value:
foo<<<gridDim,blockDim>>>(struct_or_arrays, result);
}
If what you want is dynamic typing with C/C++, this would be at worst impossible, or at best very ugly. Certainly no good solution would be possible in device code. You might consider some of the CUDA Python bindings for your problem, as Python is a language with dynamic typing.
If what you want is dynamically sized arrays, then you basically need to make your SOA a dynamically-sized array of pointers.
If you know what types you’ll need at compile time, you could create a tuple of arrays.
In fact, Boost’s zip_iterator nicely encapsulates the SOA idea. Unfortunately, it’s not usable in CUDA yet.
You just have to marshall things? I’m not sure why this is hard or any different from structs in the first place? Something like
array a_host, *a_device; //a_host is only used for marshalling the cudaMemcpy, not for actual storage
soa s_host, *s_device; //same for s_host
cudaMalloc((void**)&a_host.elements, sizeof(int) * someNumberOfElements);
cudaMalloc((void**)&a_device, sizeof(array));
cudaMemcpy(a_device, &a_host, sizeof(array), cudaMemcpyHostToDevice);
cudaMalloc((void**)s_device, sizeof(soa));
s_host.array = a_device;
cudaMemcpy(s_device, &s_host, sizeof(soa), cudaMemcpyHostToDevice);
It’s annoying, but that’s the price you pay when you want to use pointer-based structures in remote address spaces.
Thanks. I’ve tried this and I can make it work when this structs represent an output in my kernel! Which is what I want :)
But I can’t make it work when I want to initialize a struct like this in host, pass it to device, and for example copy all the elements from the input to the output (although what I want to do later on is a bit more complicated).