Is there any effectively memory copy function in __device__ call?

environment

i’m using cuda 10.1 version. vs2017 cmake 3.17.0

function call

when using the cudaMemcpyAsync() or cudaMemcpy() in __device__ function call.

throw error

it always throws out an error that:

error: calling a __host__ function("cudaMemcpyAsync") from a __device__ function("Init") is not allowed

required help

i need to copy 0 - 4K bytes buffer inside device function. and the buffer must be Synchronize generated by the previous result. i want to copy the memory as fast as possible while must not appear some errors likes

such as

  1. read after write
  2. write before write.

Here some relative links in Is there an equivalent to memcpy() that works inside a CUDA kernel? - Stack Overflow

Hi 15618561709

Since the __device__ functions are kernels which are intended for invocation inside of the device, you cannot do use neither of the cuda-based memory transactions (cudaMemcpyAsync() or cudaMemcpy()) in old CUDA implementations. You can only invoke these functions from a host function.

About your question itself, GPU is kind of primitive in the sense that they have minimal instructions. There are two possible approaches:

  1. Copy element-wise inside of the kernel (you can guarantee the synchronisation inside of the same block) by using all the threads and distributing the task amongst threads. Also, it is useful to have a flag which acts as a kind of mutex.
  2. Split your kernel in two small kernels and perform the copy transaction interacting with the host by pointer bypass and swapping.

In newer versions of CUDA, you can have dynamic parallelism, which allows you to use those instructions. You might be interested in: Programming Guide :: CUDA Toolkit Documentation

Regards,
Leon.