How would I correctly allocate memory to ‘arrayOfThings’ during runtime after two arguments are specified:
int arrLen - size of ‘arrayOfThings’
int maxStrLen - maximum length of variable ‘str’ of structure Thing
I assume I would use cudaMalloc for ‘arrayOfThings’:
Allocating pointer-based data structures on the GPU is an exercise in not-very-fun coding. The outline is something like:
allocate an array of Things on the host for marshalling to the GPU
fill in the other two values based on whatever structures you have on the CPU
allocate each string on the device
set the poniter str in each Thing to the respective string you allocated on the device
allocate an array of Things on the device
If you have a maximum length of your string (and it isn’t horribly inefficient–256 bytes when 99% of strings are 8 bytes or shorter, for example), just using a fixed length array is a much easier alternative. It’s one cudaMalloc instead of elements + 1.
Doing a two-dimensional array is pretty much the same thing–lots of host-side marshalling. A one-dimensional array that you index into via your grid and block dimensions is much nicer (both for perf and for readability).
However, there’s one thing I’m not sure of, now that I think about it. What would happen if you stored the array of pointers to arrays of Things in constant memory but stored the Things themselves in global memory? Has anyone tried something like this? I’m curious as to what the perf would be like.