I have a number of complicated device global structures in an array that I need to initialize. The only ways I have come up with to initialize a device structure instance from host code are:
- cudaMalloc a device structure, malloc a host structure of same type, initialize the host structure and cudaMemcpy the host structure to the device structure.
- The same as 1 except globally declare the device structure instance and then use cudaGetSymbolAddress() and then use cudaMemCpy() to copy host to device.
- cudaMalloc a device structure and then call a kernel<<<1.1>>> and pass structure pointer plus structure member values via arguments and use a normal structure->member=value assignment. This is done purely because it allows device pointer references to work properly, whereas they cannot from host code.
Method 3 is the most general way that works well in all cases I’ve had, but I don’t like calling the kernel serially via the <<<1,1>>> grid and block size. This works fine for initializations of my complicated neural net, but causes other problems, particularly with CudaProf trying to profile these million kernel calls in serial mode.
It seems that cudaGetSymbolAddress() only works for the top level structure object, i.e, it doesn’t seem to work to do something like cudaGetSymbolAddress on a device structure element such as structure->member .
All of this is more complicated when the device structure has pointer members. I know pointers are not good in CUDA, but sometimes they are needed. So, part of the problem is that only the device code can do structure pointer, pointer member assignments.
Is there a better way? Is there a way to obtain the symbol address for any arbitrary structure tree member level? Or, do I just have this all messed up, which is possible. I know some of you will tell me that I shouldn’t be using pointers in CUDA parallel code due to kernel divergence and non-coalesced memory loads, but for the time being I need those pointers. As I have said, in my implemented method #3 the pointers work correctly, if not optimumly.