cudaMemcpyAsync and pinned memory

himajyothi802 · August 31, 2021, 12:32pm

Hello,

Expected the time taking will be less using the pinned memory But I’m getting almost same time for both pageable memory and pinned memory.
Why is it not making any difference?

int * host_p;  
int * dev_p;   

int main(void) {  
      int data_size = 4 * sizeof(int);  
      
       cudaStream_t stream1 ;
       cudaStreamCreate ( &stream1) ;

       //host_p = (int *) malloc(data_size);
       cudaMallocHost(&host_p, data_size);   //pinned memory
       cudaMalloc(&dev_p, data_size); 

      /* Transfer data p --> dev_p */
      cudaMemcpyAsync(dev_p, host_p, data_size, cudaMemcpyHostToDevice,0);  
      //cudaFree(host_p); 
      cudaFreeHost(host_p);
      cudaFree(dev_p);  
    
      return 0;
  }

If I’m doing the cudaMemcpyAsync operation followed by Kernel using the same data in the default stream .
Whether these two operations will be serialized? or kernel will execute with default previous data?

If I use different streams for above two operations, then what kind of behaviour can we expect?

Robert_Crovella · August 31, 2021, 2:05pm

Its not clear what you are timing. If you are timing the whole code, and you are only using the pinned memory once, then the cost to pin the memory will offset the time savings associated with the improved speed of the cudaMemcpyAsync operation. If you reuse the pinned buffer many times, you will see the speed benefit at the application level. If you are carefully timing just the copy operation, in general it should be faster from pinned buffer.
The “legacy” default stream semantics require that all operations (regardless of their stream) issued to the device prior to the operation issued into the default stream must complete before the operation issued into the default stream can begin. Likewise, operations issued to that device (regardless of stream) after the operation issued into the default stream, will not begin until the operation issued into the default stream has completed. So briefly, those operations you mention will serialize.
If you issue a cudaMemcpyAsync HostToDevice into a non-default stream, from a pinned allocation, and then you issue a kernel call into a different non-default stream, the kernel call and the cudaMemcpyAsync operation may overlap. The precise arrangement is not specified by CUDA but instead depends on your exact application and what work is issued. But in any event, you run the risk of the kernel consuming data that was not touched by that particular cudaMemcpyAsync operation.

Topic		Replies	Views
How does the cudaMemcpyAsync work with not page-locked memory? CUDA Programming and Performance	4	465	August 28, 2023
Confusion about synchronization or asynchronization of cudaMemcpy() and cudaMemcpyAsync() CUDA Programming and Performance	5	3849	December 23, 2023
Memcpy timing CUDA Programming and Performance	0	523	May 4, 2014
CPU blocked MUCH longer than expected calling a cudaMemcpy after a cuda graph launch CUDA Programming and Performance	7	566	October 19, 2023
Synchronization of cudaMemcpyAsync for pageable memory CUDA Programming and Performance	2	1690	October 3, 2021
cudaMemcpyAsync makes code faster even when using the default stream 0 CUDA Programming and Performance	1	1501	January 10, 2022
Is cudaMemset actually "asynchronous"? CUDA Programming and Performance	5	7875	January 5, 2016
cudaMemcpyAync with pageable memory overlap with kernal CUDA Programming and Performance cuda	3	735	January 23, 2023
some cuda question CUDA Programming and Performance	6	980	December 23, 2015
CUDA streams questions CUDA Programming and Performance	1	1016	May 17, 2015

cudaMemcpyAsync and pinned memory

Related topics