copy global memory by CUDA threads


I need to copy one array in global memory to another array in global memory by CUDA threads (not from the host).

My code is as follows:

__global__ void copy_kernel(int *g_data1, int *g_data2, int n)


  int idx = blockIdx.x * blockDim.x + threadIdx.x;

  int start, end;

  start = some_func(idx);

  end = another_func(idx);

  unsigned int i;

  for (i = start; i < end; i++) {

      g_data2[i] = g_data1[idx];



It is very inefficient because for some idx, the [start, end] region is very large, which makes that thread issue too many copy commands. Is there any way to implement it efficiently?

Thank you,


Your approach is going to result in non-concerent IO which will slow things down a lot. (for any case where end != start)

Cuda supports more threads than the largest RAM you can have at the moment, so a simple approach is to have each thread copy one value only, and use more threads.

e.g. if g_data1 and g_data2 are both byte and have 1,000,000,000 elements and your blocksize is say 256, then you will need about 4 million blocks (4 million blocks is far less than the maximum grid size of 64k x 64k supported by cuda.)

New code will be something like this

__global__ void copy_kernel(int *g_data1, int *g_data2, SizeOfArrays)


  int idx = ...... ;  // Let you work this line out

  if ( idx < SizeOfArrays) {

    g_data2[idx] = g_data1[idx];



You need to redesign your program, maybe split kernell or rearrange.

To elaborate on what Lev said, you want to redesign your code so that a full blocks (or at least full warps) execute a transfer, not single threads on their own. That way memory accesses can be coalesced and become much more efficient.
If that is difficult to achieve, you can still redistribute the work during kernel execution: have the single threads just store start and end of the copy, __syncthreads(), do the copying using full warps at a time, __syncthreads() again, and continue your kernel.