memcpy() in a device function? how is it implemented, how does it perform?

cbuchner1 · October 7, 2009, 2:22pm

Hi everybody.

Just this week I embarrassed myself by pointing out to a student intern that CUDA probably won’t support the memcpy() function on the device. But to my surprise, his code compiled.

So how is memcpy() implemented? Is it an nvcc compiler intrinsic, is it a device function? Is it optimized in any way to execute fast on the device?

seibert · October 7, 2009, 2:45pm

If you look at the PTX, it expands to a simple loop. It does not do anything smart with multiple threads (like interleaving for coalesced reads). Except for copies between local memory, doesn’t seem very useful.

CUDA:

__global__ void memcpytest(float *a, float *b)

{

  memcpy(a, b, sizeof(float) * 10);

}

PTX:

ld.param.u64	%rd1, [__cudaparm___globfunc__Z10memcpytestPfS__b];	 

// id:12 __cudaparm___globfunc__Z10memcpytestPfS__b+0x0

		ld.param.u64	%rd2, [__cudaparm___globfunc__Z10memcpytestPfS__a];	 //

 id:11 __cudaparm___globfunc__Z10memcpytestPfS__a+0x0

		mov.s32		 %r1, 0;				 // 

		mov.s16		 %rh1, 10;			   // 

$L_0_1:

 //<loop> Loop body line 3, nesting depth: 1, iterations: 10

		cvt.u64.s32	 %rd3, %r1;			  // 

		add.u64		 %rd4, %rd3, %rd1;	   // 

		ld.global.u32   %r2, [%rd4+0];  // id:14

		cvt.u64.s32	 %rd5, %r1;			  // 

		add.u64		 %rd6, %rd5, %rd2;	   // 

		st.global.u32   [%rd6+0], %r2;  // id:13

		add.s32		 %r1, %r1, 4;			// 

		sub.s16		 %rh1, %rh1, 1;		  // 

		mov.u16		 %rh2, 0;				// 

		setp.gt.s16	 %p1, %rh1, %rh2;		// 

		@%p1 bra		$L_0_1;				 // 

		.loc	14	  4	   0

		exit;						   //

cbuchner1 · October 7, 2009, 3:12pm

Ugh, in your example the compiler definitely knew the size of the copy operation. Why didn’t it unroll the loop, at least partially?

EDIT: I wasn’t really expecting this memcpy() to be optimized for multiple threads (a smart compiler possibly could figure out if the arguments passed to memcpy are identical for all threads, or thread-specific - and optimize the internal workings of the memcpy accordingly). But that it is implemented as a simple loop, that is a little disappointing.

Christian

Topic		Replies	Views
how to perform memcpy in __device__ function CUDA Programming and Performance	0	1403	September 27, 2007
memcpy on device is there a way to use STL memcpy on tne device? CUDA Programming and Performance	3	1019	January 21, 2011
memcpy on GPU Is it possible? CUDA Programming and Performance	3	5886	May 28, 2009
memcpy() on the device CUDA Programming and Performance	4	1607	August 19, 2010
how to perform memcpy in __device__ function CUDA Programming and Performance	1	4720	September 27, 2007
Memory copy on the device? Is it possible? CUDA Programming and Performance	1	1259	October 23, 2008
cudaMemcpy() behavior question CUDA Programming and Performance	4	6711	August 8, 2007
memcpy in kernel? is it possible CUDA Programming and Performance	3	4452	October 27, 2008
Moving memory cudaMemmove() CUDA Programming and Performance	7	4945	April 22, 2019
faster than device to device memcpy() ? CUDA Programming and Performance	4	2875	June 24, 2009

memcpy() in a device function? how is it implemented, how does it perform?

Related topics