How to memcpy from static global deivce array to static global deivce array.

Assume that there are two static global device arrays like below.
device int gnArray1[10000];
device int gnArray2[10000];
how can i copy the one’s content to other on host and device?

  • On host, must i get the address of one at least?
    without getting the address, isn’t there any function or trick to solve this on host?

  • on device, is it possible to use memcpy(gnArray1, gnArray2, sizeof(int) * 10000);?
    in this case, gnArray1 and gnArray2 are considered as address not symbol?

Please clarify me with these two problems.
Thanks in advance.

On the host, I would suggest taking the address of both symbols using cudaGetSymbolAddress, and then use ordinary cudaMemcpy using cudaMemcpyDeviceToDevice.

On the device, it should be possible to use memcpy (the gnArray1 and gnArray2 will be pointers when referenced in device code), but it will be slow. Instead, you should use a fast copy kernel.

Thanks txbob.
I have 2 questions according to your answer.
at first, Why memcpy called in device is slower than cudaMemcpy called in host? In spite that those are all device copy.
Second, what is fast copy kernel?

memcpy called from device code is operating from a single device thread, doing a serial copy. This is not fast. A device-to-device memcpy called from the host either uses a fast copy kernel or hardware copy engines on the GPU, which are fast.

a fast copy kernel looks something like this:

template <typename T>
__global__ void mycopy(T *dst, T *src, size_t dsize){

  int idx = threadIdx.x + blockDim.x*blockIdx.x;
  while (idx < dsize){
    idx += gridDim.x*blockDim.x;}

I got it, thank you so much.