DMA transfer mechanism

I coded a quick test to measure the time taken for a Host to Device async memcopy and a Device to Host async memcopy - both having different source and destination addresses and both using Host pinned memory. Each async memcopy was put on a separate stream, and I used cudaEvents on stream 0 to measure the time taken for both to complete.

It seemed that the copies were not occurring simultaneously. The total time taken for both async calls to complete was the same as the sum of the times for each call to complete.

I would assume that transfers to and from the board were done via separate DMA mechanisms and I should expect that the max time for both to complete would be just the time for the greater of the two to compete.

Am I missing something here, or is there a better way to perform this test? Here’s a code snippet. (Additionally, I divide the data into chunks to measure any rate changes. But you can assume that num_chunks = 1.)

// time memcopy from device

	cudaEventRecord(start_event, 0); 

	for (int chunk_id = 0; chunk_id < num_chunks; chunk_id++)


     cudaMemcpyAsync((float *)(devArray_a + (chunk_id*chunk_size*2)),

       (float *)(hostArray_a + (chunk_id*chunk_size*2)),

       (chunk_size*2*sizeof(float)) , 

       cudaMemcpyHostToDevice, streams[0]);

     cudaMemcpyAsync((float *)(hostArray_b + (chunk_id*chunk_size*2)), 

       (float *)(devArray_b + (chunk_id*chunk_size*2)),

       (chunk_size*2*sizeof(float)) , 

       cudaMemcpyDeviceToHost, streams[1]);


	cudaEventRecord(stop_event, 0);


My board is a GTS8600 and runs at Host to Device 2.6GBps, Device to Host 1.8 GBps (with pinned memory)

Thanks for any insight on this.


On this hardware generation (compute 1.1) , you can do HtoD OR DtoH transfers and overlap kernel execution, it is not possible to do HtoD AND DtoH at the same time.

Thank you for the clarification. :)

Is this feature planned (or on a wish list) for future revs?