''cudaMemcpy'' failed to copy from device memory dynamically allocate using ''malloc''

According to Programming Guide :: CUDA Toolkit Documentation, we may use malloc and free in CUDA kernels or device functions to dynamically allocate or deallocate global memory. I am wondering if cudaMemcpy can be used to copy data from a pointer returned by malloc to host storage. For clarity, suppose we have

__global__ void SOME_KERNEL(SOME_STRUCT* obj, ...)

It is my understanding that we can first copy the object of SOME_STRUCT to host using, say,

cudaMemcpy((SOME_STRUCT*) obj_host, (SOME_STRUCT*) obj_device, sizeof(SOME_STRUCT), cudaMemcpyDeviceToHost);

For clarity, I am using the modifier (SOME_STRUCT*) to emphasize the type of data. I understand that (void *) is the standard type to use for cudaMemcpy calls.

Then obj->SOME_MEMBER_PTR is a pointer on host whose address is converted to device memory in some look-up-table. Intuitively, we should be able to use the following to copy data stored in the dynamically allocated memory:

cudaMemcpy((TYPE_OF_MEMBER*)mem_host, (TYPE_OF_MEMBER*)obj_host->SOME_MEMBER_PTR, sizeof(TYPE_OF_MEMBER), cudaMemcpyDeviceToHost);

However, it seems that this approach does not apply to dynamically allocated memory. On the contrary, there is no problem in copying data from memory allocated by cudaAlloc. Could anyone confirm my finding and explain why cudaMemcpy cannot be used to copy from dynamically allocated memory?

I may be missing something, but looking here, I understand the labels on the left hand side to indicate the domains in which the functions can be called.

cudaMemcpy appears to be only usable on the host.

cudaMemcpy is indeed called on host. The point is, if it can be used to copy data from dynamically-allocated device memory to host. Copying data from device memory allocated using cudaMalloc is no doubt.

My misunderstanding. I thought you were wanting to call cudaMemcpy on the device.

In the linked section of the programming guide, it clearly states that it is not possible to copy kernel malloc data directly to the host.

In addition, memory allocated by a call to malloc() or __nv_aligned_device_malloc() in device code cannot be used in any runtime or driver API calls (i.e. cudaMemcpy, cudaMemset, etc).

If you want to copy the data to the host, you need to first copy it via kernel into memory allocated with cudaMalloc*, then transfer that memory to the host.

I see. Thanks!