copy global memory by kernel threads

smilee · January 16, 2011, 10:14pm

Hi,

I need to copy one array in global memory to another array in global memory by CUDA threads (not from the host).

My code is as follows:

__global__ void copy_kernel(int *g_data1, int *g_data2, int n)

{

  int idx = blockIdx.x * blockDim.x + threadIdx.x;

  int start, end;

  start = some_func(idx);

  end = another_func(idx);

  unsigned int i;

  for (i = start; i < end; i++) {

      g_data2[i] = g_data1[idx];

  }

}

It is very inefficient because for some idx, the [start, end] region is very large, which makes that thread issue too many copy commands. Is there any way to implement it efficiently?

Thank you,

Zheng

laughingrice · January 23, 2011, 7:58am

Hi,

I need to copy one array in global memory to another array in global memory by CUDA threads (not from the host).

My code is as follows:
__global__ void copy_kernel(int *g_data1, int *g_data2, int n)

{

  int idx = blockIdx.x * blockDim.x + threadIdx.x;

  int start, end;

  start = some_func(idx);

  end = another_func(idx);

  unsigned int i;

  for (i = start; i < end; i++) {

      g_data2[i] = g_data1[idx];

  }

}
It is very inefficient because for some idx, the [start, end] region is very large, which makes that thread issue too many copy commands. Is there any way to implement it efficiently?

Thank you,

Zheng

It is very inefficient because you are doing uncoalesced read/writes.

What you should do:

Make sure that you have enough threads to fill the card (at least 2000 minimum for a descent card, and quite a few would say that that isn’t enough as well)
Make sure that threads in the same warp copy the same number of elements if possible
Make sure that threads in the same half warp copy consecutive memory, i.e the loop should look something like

__global__ void copy_kernel(int *g_data1, int *g_data2, int n)

{

  int idx = blockIdx.x * blockDim.x + threadIdx.x;

  int start, end;

  start = some_func(idx);

  end = another_func(idx);

  unsigned int i;

  for (i = start; i < end; i += blockDim.x) {

      g_data2[i] = g_data1[idx];

  }

}

Although you’ll need to adjust start and end appropriately.

Also with your code, the thread writes the same data to all memory locations (it does data duplication, so you should do that via registers, so that would results with the following code (still no write coalescing but if you are doing the thing you are thinking that you are doing then I’m not sure if there is another way).

__global__ void copy_kernel(int *g_data1, int *g_data2, int n)

{

  int idx = blockIdx.x * blockDim.x + threadIdx.x;

  int start, end;

  start = some_func(idx);

  end = another_func(idx);

int data = g_data1[idx];

  unsigned int i;

  for (i = start; i < end; i += blockDim.x) {

      g_data2[i] = data;

  }

}

You could also get some better performance in this case by writing larger blocks of data. Smallest write size is 32 bytes, writing smaller non-consecutive chunks causes wasted bandwidth. You could get half that by using int4. you can get closer by creating a custom int8.

Topic		Replies	Views
copy global memory by CUDA threads CUDA Programming and Performance	3	1276	January 17, 2011
moving data between Device Global to Device Shared CUDA Programming and Performance	7	5490	February 12, 2009
Copying large data amount shared to global ? cudaMemCpy doesn't work in kernel... CUDA Programming and Performance	1	3803	May 30, 2007
Non-sequencial memory access coalescing CUDA Programming and Performance	4	900	April 2, 2013
Batch write CUDA Programming and Performance	1	4889	September 22, 2008
memcpy equivalent for global memory to shared memo CUDA Programming and Performance	5	9355	November 12, 2007
numbers of write to global memory for each thread CUDA Programming and Performance	3	2145	March 31, 2008
Shared Memory question CUDA Programming and Performance	5	2961	November 25, 2016
Global Memory Coalescing: Read and Write Memory Coalescing CUDA Programming and Performance	9	8331	July 31, 2007
copying to shared block mem CUDA Programming and Performance	11	4315	April 6, 2008

copy global memory by kernel threads

Related topics