DMA transfer mechanism

skb · March 21, 2008, 3:14pm

I coded a quick test to measure the time taken for a Host to Device async memcopy and a Device to Host async memcopy - both having different source and destination addresses and both using Host pinned memory. Each async memcopy was put on a separate stream, and I used cudaEvents on stream 0 to measure the time taken for both to complete.

It seemed that the copies were not occurring simultaneously. The total time taken for both async calls to complete was the same as the sum of the times for each call to complete.

I would assume that transfers to and from the board were done via separate DMA mechanisms and I should expect that the max time for both to complete would be just the time for the greater of the two to compete.

Am I missing something here, or is there a better way to perform this test? Here’s a code snippet. (Additionally, I divide the data into chunks to measure any rate changes. But you can assume that num_chunks = 1.)

// time memcopy from device

	cudaEventRecord(start_event, 0); 

	for (int chunk_id = 0; chunk_id < num_chunks; chunk_id++)

   {

     cudaMemcpyAsync((float *)(devArray_a + (chunk_id*chunk_size*2)),

       (float *)(hostArray_a + (chunk_id*chunk_size*2)),

       (chunk_size*2*sizeof(float)) , 

       cudaMemcpyHostToDevice, streams[0]);

     cudaMemcpyAsync((float *)(hostArray_b + (chunk_id*chunk_size*2)), 

       (float *)(devArray_b + (chunk_id*chunk_size*2)),

       (chunk_size*2*sizeof(float)) , 

       cudaMemcpyDeviceToHost, streams[1]);

   }

	cudaEventRecord(stop_event, 0);

	cudaEventSynchronize(stop_event);

My board is a GTS8600 and runs at Host to Device 2.6GBps, Device to Host 1.8 GBps (with pinned memory)

Thanks for any insight on this.

skb

mfatica · March 21, 2008, 3:19pm

On this hardware generation (compute 1.1) , you can do HtoD OR DtoH transfers and overlap kernel execution, it is not possible to do HtoD AND DtoH at the same time.

skb · March 21, 2008, 3:27pm

Thank you for the clarification. :)

Is this feature planned (or on a wish list) for future revs?

skb

Topic		Replies	Views
cudaMemcpyAsync H2D and D2H overlap CUDA Programming and Performance	2	5661	November 25, 2009
cudaMemcpyAsync Question Overlap HostToDevice and DeviceToHost trasfers CUDA Programming and Performance	2	5685	April 2, 2009
Concurrent Data Transfers CUDA Programming and Performance	9	7788	April 27, 2012
cudaMemcpyAsync same direction overlap CUDA Programming and Performance	1	359	June 29, 2023
Asynchronous Memcpy's not overlapping with asynchronous kernel execution despite using cuda streams? CUDA Programming and Performance cuda	4	1198	October 31, 2022
cudaMemcpyAsync HtoD and DtoH blocking each other CUDA Programming and Performance	4	569	April 25, 2024
Data transfer from host to two GPUs in a cluster (MultiGPU Programming) CUDA Programming and Performance	2	987	December 3, 2012
some memcopy questions async, ping pong buffering, streaming CUDA Programming and Performance	5	3402	April 29, 2008
CUDA: combining H2D and D2H memory transfer operations CUDA Programming and Performance	7	3763	March 1, 2015
Maxwell. Overlapping data transfers CUDA Programming and Performance	6	1233	January 29, 2015

DMA transfer mechanism

Related topics