Device/Host Address Tricks

I have a number of complicated device global structures in an array that I need to initialize. The only ways I have come up with to initialize a device structure instance from host code are:

  1. cudaMalloc a device structure, malloc a host structure of same type, initialize the host structure and cudaMemcpy the host structure to the device structure.
  2. The same as 1 except globally declare the device structure instance and then use cudaGetSymbolAddress() and then use cudaMemCpy() to copy host to device.
  3. cudaMalloc a device structure and then call a kernel<<<1.1>>> and pass structure pointer plus structure member values via arguments and use a normal structure->member=value assignment. This is done purely because it allows device pointer references to work properly, whereas they cannot from host code.

Method 3 is the most general way that works well in all cases I’ve had, but I don’t like calling the kernel serially via the <<<1,1>>> grid and block size. This works fine for initializations of my complicated neural net, but causes other problems, particularly with CudaProf trying to profile these million kernel calls in serial mode.

It seems that cudaGetSymbolAddress() only works for the top level structure object, i.e, it doesn’t seem to work to do something like cudaGetSymbolAddress on a device structure element such as structure->member .

All of this is more complicated when the device structure has pointer members. I know pointers are not good in CUDA, but sometimes they are needed. So, part of the problem is that only the device code can do structure pointer, pointer member assignments.

Is there a better way? Is there a way to obtain the symbol address for any arbitrary structure tree member level? Or, do I just have this all messed up, which is possible. I know some of you will tell me that I shouldn’t be using pointers in CUDA parallel code due to kernel divergence and non-coalesced memory loads, but for the time being I need those pointers. As I have said, in my implemented method #3 the pointers work correctly, if not optimumly.

Thanks,
Ken Chaffin

Can you allocate your nodes in a flat array, with indices rather than pointers? At that point, you could call a single kernel to convert the indices to pointers for the entire array if you wanted.

That’s a good suggestion. I’ll take a look at this. These pointers are for synapses that connect neurons to neurons. Of course the question remains what is the best way to initialize large device arrays of structure members.

Thanks,

Ken

I did convert to using a single 1D array of synapses rather than my own device memory allocation method. It is a bit cleaner than what I had before, but I did run into the ~1.7GB max allocation for the Tesla board when I increased my synaptic connectivity to about 400/neuron in my 320x240 3 layer neural net.

But, I would still like to hear from others what the best way to initialize complicated device global data array members is. Using a trivial kernel<<<1,1>>> works, but I don’t like doing this and it messes up CudaProf.

Ken

Perhaps someone can enlighten me on this. Basically what I want to do is declare a device variable at file scope level:

device int devicevar;

Then I want to dynamically allocate a device variable in host code:

int *var;

cudaMalloc((void**)&var, sizeof(int));

Now, at this point, my understanding is that var is allocated in device space even though not qualified as device, right?

So say I want to assign var a value, I should be able to do:

int value=5;

cudaMemcpy((void*)var,(void*)&value, cudaMemcpyHostToDevice);

right?

And then I want to copy var to devicevar, so this is a device to device copy but I need the devicevar address so I get it via:

int *symboladdress;

GetSymbolAddress((void**)&symboladdress, devicevar);

right?

And lastly I copy data from the dynamically allocated device memory to the file scope device memory via:

cudeMemcpy((void*)symboladdress, (void*)var, cudaMemcpyDeviceToDevice);

right?

I know this seems a bit convoluted, but it is a way to associate dynamically allocated device synapse structure objects to a file scope device neuron structures, even though this is not what my example does. It gets a bit more complicated because I want to copy the address of the dynamically allocated device variable to a structure member of the file scope device variable.

So, my last question is whether I can copy the the address of var to a file scope device variable via cudaMemcpy(…, cudaMemcpyDeviceToDevice); using the address returned by cudaMalloc() for the dynamic variable, and the address returned by GetSymbolAddress() for the file scope device variable.

When I do this last step, I get an “invalid argument” return, which leads me to think that my pointer nomenclature is wrong. I’m wondering if any of the variables I’m working with needs to be declared device other than my file scope device variable, or what else might be wrong.

Thanks to anyone willing to wade through this. And I hope I haven’t mistyped anything here as I simplied it from my complicated arrays of structures of arrays.

Ken Chaffin

I think the problem that I am having is that I’m trying to do a no-no. If I have a device structure pointer, returned either by cudaGetSymbol() or cudaMalloc(), I probably should never in host code do something like structurepointer->member . If this is the case, then I’m beginning to think that there is no way to initalize, assign structures based on pointers (that have pointers as members) other than via trivial kernel<<<1,1>>>(structureptr) calls. If anyone knows a way to do this otherwise, I sure would like to know.

It sure would be nice to be able to call a non kernel device function from host code in serial mode.

Ken Chaffin

That is the core problem. You can’t dereference device pointers in host memory. Nobody what clever ideas you come up with, you cannot circumvent that basic fact. For structures of arrays, it is possible get away without requiring an initialization kernel on the device side, but for arrays of structures, trees, linked lists, etc, you can’t, because the basic problem remains - once you have a second level of pointer indirection, you have to dereference a device pointer, and that can’t be done in the host address space.

As unpalatable as it might seem, flat memory structures with indexing and/or pointer arithmetic are more preferable. You loose a degree of elegance - your CUDA C code might wind up looking more like Fortran 77 - but it is ultimately more manageable. It is often better to think of the CUDA paradigm as distributed memory, because the device and host memory have their own address space and are basically incoherent at all times. Like with MPI, pointers and sharing and passing of structures relying on pointers is basically verboten.

Thanks for the reply. Although I understand what you are saying, pointers are not verboten inside device code. If I allocate an device array in host code, I can take the address of an element &array[index] and pass that to a device kernel and do whatever I want to with that pointer in device code. Works great, if perhaps not optimally compared to a flat indexed array representation. I form linked pointer lists of synapse structures at the neuron level and am able to successfully traverse the list etc.

My problems arise when I try to avoid using the trivial kernel<<<1,1>>>(devicepointer) to initialze the structure that the pointer points to members. The trivial kernel initialization works great, but is slow and messes up CudaProf as it tries to profile several million trivial kernel calls.

I will probably have to give up finding an alternative to the trivial kernel method. But, I might come up with a way to initialize with a non-trivial kernel in parallel, perhaps.

Ken

I wasn’t suggesting that they are. I am only making the observation that, like in other forms of distributed memory programming, it is not recommended and usually not possible to pass or manipulate data and code whose functionality relies on absolute addresses or pointer indirection from one memory space to another.