copy global memory by kernel threads

Hi,

I need to copy one array in global memory to another array in global memory by CUDA threads (not from the host).

My code is as follows:

__global__ void copy_kernel(int *g_data1, int *g_data2, int n)

{

  int idx = blockIdx.x * blockDim.x + threadIdx.x;

  int start, end;

  start = some_func(idx);

  end = another_func(idx);

  unsigned int i;

  for (i = start; i < end; i++) {

      g_data2[i] = g_data1[idx];

  }

}

It is very inefficient because for some idx, the [start, end] region is very large, which makes that thread issue too many copy commands. Is there any way to implement it efficiently?

Thank you,

Zheng

It is very inefficient because you are doing uncoalesced read/writes.

What you should do:

  • Make sure that you have enough threads to fill the card (at least 2000 minimum for a descent card, and quite a few would say that that isn’t enough as well)

  • Make sure that threads in the same warp copy the same number of elements if possible

  • Make sure that threads in the same half warp copy consecutive memory, i.e the loop should look something like

__global__ void copy_kernel(int *g_data1, int *g_data2, int n)

{

  int idx = blockIdx.x * blockDim.x + threadIdx.x;

  int start, end;

  start = some_func(idx);

  end = another_func(idx);

  unsigned int i;

  for (i = start; i < end; i += blockDim.x) {

      g_data2[i] = g_data1[idx];

  }

}

Although you’ll need to adjust start and end appropriately.

  • Also with your code, the thread writes the same data to all memory locations (it does data duplication, so you should do that via registers, so that would results with the following code (still no write coalescing but if you are doing the thing you are thinking that you are doing then I’m not sure if there is another way).
__global__ void copy_kernel(int *g_data1, int *g_data2, int n)

{

  int idx = blockIdx.x * blockDim.x + threadIdx.x;

  int start, end;

  start = some_func(idx);

  end = another_func(idx);

int data = g_data1[idx];

  unsigned int i;

  for (i = start; i < end; i += blockDim.x) {

      g_data2[i] = data;

  }

}
  • You could also get some better performance in this case by writing larger blocks of data. Smallest write size is 32 bytes, writing smaller non-consecutive chunks causes wasted bandwidth. You could get half that by using int4. you can get closer by creating a custom int8.