Time for coping array to device

shyboy_6104 · April 19, 2012, 1:30am

I am writing a program. In this program, I want to copy a sub-array from host to device. But the performance is poor. So I test the memory copy performance.

	module test	
	   use cudafor
	contains
	   attributes(global) subroutine cal_I(a,b)
	       real :: a(90,90,90,3),b(90,90,90,3)
		
	   end subroutine cal_I
	end module test
	
	program main
	use cudafor
	use test
			
	type ( cudaEvent ) :: startEvent , stopEvent, StepStartEvt, StepStopEvt
	real(4) :: time
	integer :: istat
	
	real :: a(100,100,100,3),b(100,100,100,3)
	real :: a1(90,90,90,3),b1(90,90,90,3)
	real, device :: dev_a1(90,90,90,3), dev_b1(90,90,90,3)
	integer :: i
	a=1.0
	b=2.0
	   
	istat = cudaEventCreate ( startEvent )
	istat = cudaEventCreate ( stopEvent )
	
	istat = cudaEventRecord ( startEvent , 0)
	
	dev_a1(:,:,:,:) = a(1:90,1:90,1:90,:)
	dev_b1(:,:,:,:) = b(1:90,1:90,1:90,:)
	
	istat = cudaEventRecord ( stopEvent , 0)
	istat = cudaEventSynchronize ( stopEvent )
	istat = cudaEventElapsedTime (time , startEvent , stopEvent )
	write (* ,*) ' Time for copy1: ', time
	
	istat = cudaEventRecord ( startEvent , 0)
	do i=1,3
	dev_a1(:,:,:,i) = a(1:90,1:90,1:90,i)
	dev_b1(:,:,:,i) = b(1:90,1:90,1:90,i)
	enddo
	istat = cudaEventRecord ( stopEvent , 0)
	istat = cudaEventSynchronize ( stopEvent )
	istat = cudaEventElapsedTime (time , startEvent , stopEvent )
	write (* ,*) ' Time for copy2: ', time
	
	istat = cudaEventRecord ( startEvent , 0)
	dev_a1 = a1
	dev_b1 = b1
	istat = cudaEventRecord ( stopEvent , 0)
	istat = cudaEventSynchronize ( stopEvent )
	istat = cudaEventElapsedTime (time , startEvent , stopEvent )
	write (* ,*) ' Time for copy3: ', time
	
	istat = cudaEventRecord ( startEvent , 0)
	do i=1,3
	dev_a1(:,:,:,i) = a1(:,:,:,i)
	dev_b1(:,:,:,i) = b1(:,:,:,i)
	enddo
	istat = cudaEventRecord ( stopEvent , 0)
	istat = cudaEventSynchronize ( stopEvent )
	istat = cudaEventElapsedTime (time , startEvent , stopEvent )
	write (* ,*) ' Time for copy4: ', time
		
	call cal_I<<<1>>>(dev_a1,dev_b1)

	end program main

In the first case, I do not use loop to copy subarrays, while in the second case, I use loop to copy subarrays.
In the 3rd case, I do not use loop to copy array, while in the 4th case, I use loop to copy array. I run the program on M2050, and get the result.

Time for copy1:     28.94928    
 Time for copy2:     44.00723    
 Time for copy3:     4.118048    
 Time for copy4:     43.05971

According to the result, the time for copy the entire array is least. So should I copy the subarray to a array which is on host, then copy the entire array to device?

MatColgrove · April 19, 2012, 4:34pm

Hi shyboy_6104,

There is a significant amount of overhead in copying data to and from the GPU. Hence, reducing the frequency of copies is important. However, since DMA transfers must be on contiguous data, copying sub-arrays often requires the compiler to create implicit DO loops that only copy small contiguous blocks.

In the case of example 1,2, and 4, the compiler is most likely generating a triply nested DO loop. However, example 1 is copying three elements at a time, while 2 and 4 are copying one element at at time.

My recommendation when copying sub-arrays is to do something similar as you do in example #3. Create a temp host array having the same size as the sub-array (i.e. a1 and b1), gather the sub-array into the temp array, and then copy the temp array to the device in one contiguous block. Use a similar scatter method when copying back to the host.

Though, if you have device the memory available, it may be just as fast to copy the entire original arrays and not worry about the gather and scatter operations.

Hope this helps,
Mat

shyboy_6104 · April 24, 2012, 2:37am

Hi, Mat.
The recommendation is helpful. Thanks.
In my program, I need to copy irregular arrays to device memory, so I need a loop to do this.
I found a new method, which is using pointer.

do i=1,3
		cdvx = C_devLOC(dev_a1(1,1,1,i))
		chvx = c_loc(a1(1,1,1,i))
	 	istat=cudaMemcpy(cdvx,chvx, 90*90*90*4, cudaMemcpyHostToDevice)
	 	cdvx = C_devLOC(dev_b1(1,1,1,i))
		chvx = c_loc(b1(1,1,1,i)) 
	 	istat=cudaMemcpy(cdvx,chvx, 90*90*90*4, cudaMemcpyHostToDevice)
	enddo

Comparing with example #3, there is a little overhead with the new method.

[/code]

Topic		Replies	Views
best way to copy slices of multi-dimensional array to host Legacy PGI Compilers	1	2236	February 22, 2016
assignment (device->host) performance issue Legacy PGI Compilers	4	3313	February 16, 2011
the subarray usage in copy Legacy PGI Compilers	4	2304	April 18, 2013
copy cudaArray from one device to another CUDA Programming and Performance	0	635	July 25, 2013
Question about cudaMemcpy CUDA Programming and Performance	2	599	February 21, 2011
time of copy CUDA Programming and Performance	0	782	June 18, 2010
Confusion whilst copying from host to device Legacy PGI Compilers	2	2412	July 4, 2012
Copying a single value from device CUDA Programming and Performance	2	2159	July 8, 2009
Copying part of 2D array to device CUDA Programming and Performance	2	882	March 22, 2012
cudaMemcpy to part way through an array CUDA Programming and Performance	4	6599	February 11, 2010

Time for coping array to device

Related topics