cudaMemcpyAsync and pinned memory

Hello,

  1. Expected the time taking will be less using the pinned memory But I’m getting almost same time for both pageable memory and pinned memory.
    Why is it not making any difference?
int * host_p;  
int * dev_p;   

int main(void) {  
      int data_size = 4 * sizeof(int);  
      
       cudaStream_t stream1 ;
       cudaStreamCreate ( &stream1) ;

       //host_p = (int *) malloc(data_size);
       cudaMallocHost(&host_p, data_size);   //pinned memory
       cudaMalloc(&dev_p, data_size); 

      /* Transfer data p --> dev_p */
      cudaMemcpyAsync(dev_p, host_p, data_size, cudaMemcpyHostToDevice,0);  
      //cudaFree(host_p); 
      cudaFreeHost(host_p);
      cudaFree(dev_p);  
    
      return 0;
  }

  1. If I’m doing the cudaMemcpyAsync operation followed by Kernel using the same data in the default stream .
    Whether these two operations will be serialized? or kernel will execute with default previous data?

If I use different streams for above two operations, then what kind of behaviour can we expect?

  1. Its not clear what you are timing. If you are timing the whole code, and you are only using the pinned memory once, then the cost to pin the memory will offset the time savings associated with the improved speed of the cudaMemcpyAsync operation. If you reuse the pinned buffer many times, you will see the speed benefit at the application level. If you are carefully timing just the copy operation, in general it should be faster from pinned buffer.

  2. The “legacy” default stream semantics require that all operations (regardless of their stream) issued to the device prior to the operation issued into the default stream must complete before the operation issued into the default stream can begin. Likewise, operations issued to that device (regardless of stream) after the operation issued into the default stream, will not begin until the operation issued into the default stream has completed. So briefly, those operations you mention will serialize.

  3. If you issue a cudaMemcpyAsync HostToDevice into a non-default stream, from a pinned allocation, and then you issue a kernel call into a different non-default stream, the kernel call and the cudaMemcpyAsync operation may overlap. The precise arrangement is not specified by CUDA but instead depends on your exact application and what work is issued. But in any event, you run the risk of the kernel consuming data that was not touched by that particular cudaMemcpyAsync operation.