Now we can dynamically (de)allocate memory in the device code using malloc/free. However, I cannot find how to copy that heap BACK to the host. Any hints on which API call does the trick?
cudaMemcpy? Just copy the pointer back, then use it with cudaMemcpy. I’ve never tried it, but it should work.
I understand, but what I’m really asking is how to copy the WHOLE heap back to the host. In order to use cudaMemCpy, I need a start address and the size of the memory block. I already know the size of the heap (it is possible to configure it), but I’m missing the start address.
Doing a memCpy for every pointer returned by malloc would work, but it would also be too inefficient.
You’re making assumptions that the device heap is all contiguous in addresses. This is almost certainly true, but it’s still one of those dangerous assumptions that can break things when you use hacks that suddenly no longer work when the devices or drivers change.
If it’s really so necessary to transfer dynamically malloced device memory over to the host, a safe way is to do the mallocs yourself by allocating a static big block, then using atomic increments to manually suballocate from that block at need.
This doesn’t give you the ability to free memory (easily) but it’s quite versatile and portable and fast. And then the single-block memcopy assumption WILL be true.
In fact you can even use this kind of manual allocation with zero-copy memory, avoiding the whole step of manually copying back to the host.