Cannot achieve parallel H2D copy and cuBLAS call

Hi!

I am trying to overlap a cublas GEMM call with D2H and H2D calls. I noticed that the GEMM kernel does not start executing until the H2D copy finishes. I think I avoided the common pitfalls of not running the async versions or not allocating pinned memory.

Here is the code:

const int n = 1000;

  float* data_in_host;
  float* data_in_device;

  cudaHostAlloc( (void**)&data_in_host, n*n*sizeof(float), cudaHostAllocPortable);
  cudaMalloc( (void**)&data_in_device, n*n*sizeof(float) );

  float* data_out_host;
  float* data_out_device;

  cudaHostAlloc( (void**)&data_out_host, n*n*sizeof(float), cudaHostAllocPortable);
  cudaMalloc( (void**)&data_out_device, n*n*sizeof(float) );

  float* data_proc_A_device;
  float* data_proc_B_device;
  float* data_proc_C_device;

  cudaMalloc( (void**)&data_proc_A_device, n*n*sizeof(float) );
  cudaMalloc( (void**)&data_proc_B_device, n*n*sizeof(float) );
  cudaMalloc( (void**)&data_proc_C_device, n*n*sizeof(float) );

  cudaStream_t stream_in;
  cudaStream_t stream_out;
  cudaStream_t stream_proc;

  cudaStreamCreate( &stream_in );
  cudaStreamCreate( &stream_out );
  cudaStreamCreate( &stream_proc );

  cublasHandle_t handle;

  cublasCreate( &handle );
  cublasSetStream( handle, stream_proc );

  float al = 1;
  float be = 1;

  // ====================================================

  cudaMemcpyAsync( data_in_device, data_in_host, n*n*sizeof(float), cudaMemcpyHostToDevice, stream_in );
  // cudaMemcpyAsync( data_out_host, data_out_device, n*n*sizeof(float), cudaMemcpyDeviceToHost, stream_out );
  cublasSgemm( handle, CUBLAS_OP_N, CUBLAS_OP_N, n, n, n, &al, data_proc_A_device, n, data_proc_B_device, n, &be, data_proc_C_device, n );

  // ====================================================

  cublasDestroy( handle );

  cudaStreamDestroy( stream_in );
  cudaStreamDestroy( stream_out );
  cudaStreamDestroy( stream_proc );

  cudaFreeHost( data_in_host );
  cudaFree( data_in_device );

  cudaFreeHost( data_out_host );
  cudaFree( data_out_device );

  cudaFree( data_proc_A_device );
  cudaFree( data_proc_B_device );
  cudaFree( data_proc_C_device );

Interestingly, the GEMM execution only waits for the H2D transfer, and can successfully run parallel with a D2H copy.

Here is my devicequery output:

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 2 CUDA Capable device(s)

Device 0: "GeForce GTX 1080 Ti"
  CUDA Driver Version / Runtime Version          9.1 / 9.0
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 11178 MBytes (11721506816 bytes)
  (28) Multiprocessors, (128) CUDA Cores/MP:     3584 CUDA Cores
  GPU Max Clock rate:                            1721 MHz (1.72 GHz)
  Memory Clock rate:                             5505 Mhz
  Memory Bus Width:                              352-bit
  L2 Cache Size:                                 2883584 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 8 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "GeForce GTX 1080 Ti"
  CUDA Driver Version / Runtime Version          9.1 / 9.0
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 11176 MBytes (11719409664 bytes)
  (28) Multiprocessors, (128) CUDA Cores/MP:     3584 CUDA Cores
  GPU Max Clock rate:                            1721 MHz (1.72 GHz)
  Memory Clock rate:                             5505 Mhz
  Memory Bus Width:                              352-bit
  L2 Cache Size:                                 2883584 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 65 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from GeForce GTX 1080 Ti (GPU0) -> GeForce GTX 1080 Ti (GPU1) : Yes
> Peer access from GeForce GTX 1080 Ti (GPU1) -> GeForce GTX 1080 Ti (GPU0) : Yes


Thank you for the quick response!

Yes I am aware of the WDDM batching on Windows, the results I have are from an Ubuntu 18.04.
If I run the same script in nvprof, I get the following:

$ sudo CUDA_VISIBLE_DEVICES="1" /usr/local/cuda/bin/nvprof --print-gpu-trace /usr/lib/R/bin/Rscript ./Samples/wip_overlap_bug.R
==76217== NVPROF is profiling process 76217, command: /usr/lib/R/bin/exec/R --slave --no-restore --file=./Samples/wip_overlap_bug.R
==76217== Profiling application: /usr/lib/R/bin/exec/R --slave --no-restore --file=./Samples/wip_overlap_bug.R
==76217== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
487.08ms  1.1200us                    -               -         -         -         -      112B  95.367MB/s    Pageable      Device  GeForce GTX 108         1         7  [CUDA memcpy HtoD]
487.37ms  307.05us                    -               -         -         -         -  3.8147MB  12.133GB/s      Pinned      Device  GeForce GTX 108         1        14  [CUDA memcpy HtoD]
487.69ms  1.0880us                    -               -         -         -         -      512B  448.79MB/s      Device           -  GeForce GTX 108         1        16  [CUDA memset]
487.69ms  245.99us             (8 16 2)       (128 1 1)       120  12.250KB        0B         -           -           -           -  GeForce GTX 108         1        16  maxwell_sgemm_128x64_nn [272]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy

I am sorry for the command, you have to trust me that the only thing that script does is call the above code. As you see I have just enough time between the H->D transfer and the kernel so that they don’t overlap. I have two remarks:

  • If I call the kernel first, and then the async copy, they properly overlap.
  • Why do I have an extra memset before the kernel launch?

Here is the nvprof output when calling the kernel first:

$ sudo CUDA_VISIBLE_DEVICES="1" /usr/local/cuda/bin/nvprof --print-gpu-trace /usr/lib/R/bin/Rscript ./Samples/wip_overlap_bug.R
==76625== NVPROF is profiling process 76625, command: /usr/lib/R/bin/exec/R --slave --no-restore --
==76625== Profiling application: /usr/lib/R/bin/exec/R --slave --no-restore --file=./Samples/wip_overlap_bug.R
==76625== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
483.52ms  1.1200us                    -               -         -         -         -      112B  95.367MB/s    Pageable      Device  GeForce GTX 108         1         7  [CUDA memcpy HtoD]
483.82ms     992ns                    -               -         -         -         -      512B  492.22MB/s      Device           -  GeForce GTX 108         1        16  [CUDA memset]
483.86ms  246.66us             (8 16 2)       (128 1 1)       120  12.250KB        0B         -           -           -           -  GeForce GTX 108         1        16  maxwell_sgemm_128x64_nn [271]
483.87ms  306.02us                    -               -         -         -         -  3.8147MB  12.173GB/s      Pinned      Device  GeForce GTX 108         1        14  [CUDA memcpy HtoD]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy

I acknowledge the observation. I’m able to see it as well.

I believe it is coming about because although the cudaMemsetAsync operation is issued to the same stream as the cublas kernels, and therefore a different stream than your cudaMemcpyAsync call, it is in the same PCIE “direction” as the previously issued cudaMemcpyAsync call. Therefore it cannot begin until the previous cudaMemcpyAsync call completes. It is not possible to have two outstanding transfers in the same direction on PCIE with only a single copy engine for that direction.

Since the subsequent cublas kernel calls are issued into the same stream as the cudaMemsetAsync call, they cannot begin until that call completes. Therefore everything ends up serialized.

You can work around this in this test case by reversing the order of the cublas and cudaMemcpyAsync calls:

cublasSgemm( handle, CUBLAS_OP_N, CUBLAS_OP_N, n, n, n, &al, data_proc_A_device, n, data_proc_B_device, n, &be, data_proc_C_device, n );
cudaMemcpyAsync( data_in_device, data_in_host, n*n*sizeof(float), cudaMemcpyHostToDevice, stream_in );

For typical “pipelined” usage of cublas gemm activity that I am familiar with, I think this should be feasible and I don’t think this issue order should disrupt the pipelined activity

Otherwise, if this is objectionable, I suggest filing a bug at developer.nvidia.com

Thanks for the explanation! The problem with your workaround is that in my scenario a bunch of transfers and gemm launches are issued, and so some kernel launches randomly synchronize with some H2D transfers.

I think this behavior is unintuitive, I don’t exactly know why a memset operation needs to wait for the PCI bus, or why larger gemm launches invoke memsets. I will file a bug report when I get back.