memcpy() in a device function? how is it implemented, how does it perform?

Hi everybody.

Just this week I embarrassed myself by pointing out to a student intern that CUDA probably won’t support the memcpy() function on the device. But to my surprise, his code compiled.

So how is memcpy() implemented? Is it an nvcc compiler intrinsic, is it a device function? Is it optimized in any way to execute fast on the device?

If you look at the PTX, it expands to a simple loop. It does not do anything smart with multiple threads (like interleaving for coalesced reads). Except for copies between local memory, doesn’t seem very useful.


__global__ void memcpytest(float *a, float *b)


  memcpy(a, b, sizeof(float) * 10);



ld.param.u64	%rd1, [__cudaparm___globfunc__Z10memcpytestPfS__b];	 

// id:12 __cudaparm___globfunc__Z10memcpytestPfS__b+0x0

		ld.param.u64	%rd2, [__cudaparm___globfunc__Z10memcpytestPfS__a];	 //

 id:11 __cudaparm___globfunc__Z10memcpytestPfS__a+0x0

		mov.s32		 %r1, 0;				 // 

		mov.s16		 %rh1, 10;			   // 


 //<loop> Loop body line 3, nesting depth: 1, iterations: 10

		cvt.u64.s32	 %rd3, %r1;			  // 

		add.u64		 %rd4, %rd3, %rd1;	   //   %r2, [%rd4+0];  // id:14

		cvt.u64.s32	 %rd5, %r1;			  // 

		add.u64		 %rd6, %rd5, %rd2;	   //   [%rd6+0], %r2;  // id:13

		add.s32		 %r1, %r1, 4;			// 

		sub.s16		 %rh1, %rh1, 1;		  // 

		mov.u16		 %rh2, 0;				//	 %p1, %rh1, %rh2;		// 

		@%p1 bra		$L_0_1;				 // 

		.loc	14	  4	   0

		exit;						   //

Ugh, in your example the compiler definitely knew the size of the copy operation. Why didn’t it unroll the loop, at least partially?

EDIT: I wasn’t really expecting this memcpy() to be optimized for multiple threads (a smart compiler possibly could figure out if the arguments passed to memcpy are identical for all threads, or thread-specific - and optimize the internal workings of the memcpy accordingly). But that it is implemented as a simple loop, that is a little disappointing.