Copy-Compute Overlap Performance

aarsmith54 · January 19, 2019, 2:59am

I’ve been testing some code on a GPU and I found a surprising performance results when processing data in 2 CUDA streams on a V100. In every example I’ve seen where two streams are used to overlap compute and copy operations, each stream is processing independent data. My code was structured this way. As an experiment, I tried rearranging my code so there is a dedicated compute stream and a dedicated copy stream. When I did this my code ran twice as fast.

I have no idea why this would happen. My problem has a typical copy host to device - compute - copy device to host pattern, but the first copy takes a negligible amount of time and the device to host copy takes about the same time as the compute operation.

So, my first arrangement, in pseudocode was:

for i = 1 to num_iterations
    stream = streams[i % num_streams] // num_streams = 2
    copy input data from host to device in stream
    compute in stream
    copy output data from device to host in stream

Each iteration would alternate streams. I did not issue like operations together like some blogs suggest, because I’ve read that isn’t necessary on new GPUs.

My second arrangement was:

for i = 1 to num_iterations
    copy input data from host to device in <b>compute stream</b> // Takes almost no time
    compute in <b>compute stream</b>
    synchronize to prevent the copy stream from starting too soon
    copy output data from device to host in <b>copy stream</b>

Does anyone have good intuition for why the second arrangement would be twice as fast? Is this indicative of a bug or error I haven’t discovered yet?

Basically, I am suspicious because it worked so well and I’ve never seen any examples of streams being organized this way.

Robert_Crovella · January 19, 2019, 3:09am

which previous copy, exactly? Define this step exactly. What API call(s) are you using.

aarsmith54 · January 19, 2019, 3:15am

I think I am wrong there. There is only synchronization to prevent the copy stream from starting too soon. I am using cudaStreamSynchronize(compute_stream). I updated my previous post.

Robert_Crovella · January 19, 2019, 7:42am

I don’t see anything like a 2x speedup:

$ cat t1423.cu
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

unsigned long long dtime_usec(unsigned long long start){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

const long long delay_t = 42588075LL;
__global__ void k(){

  long long start = clock64();
  while (clock64() < start + delay_t);
}

const int default_iter = 10;
const int ds = 1000000;
const int ratio  = 100;
const int num_streams = 2;

int main(int argc, char *argv[]){

  int iter = default_iter;
  if (argc > 1) iter = atoi(argv[1]);
  cudaStream_t s[num_streams];
  for (int i = 0; i < num_streams; i++)
    cudaStreamCreate(&s[i]);
  int *d1, *d2, *h1, *h2;
  cudaMalloc(&d1, ds*sizeof(d1[0]));
  cudaMalloc(&d2, ratio*ds*sizeof(d2[0]));
  cudaHostAlloc(&h1, ds*sizeof(h1[0]), cudaHostAllocDefault);
  cudaHostAlloc(&h2, ratio*ds*sizeof(h2[0]), cudaHostAllocDefault);

  unsigned long long t1 = dtime_usec(0);
  for (int i = 0; i < iter; i++){
    cudaMemcpyAsync(d1, h1, ds*sizeof(d1[0]), cudaMemcpyHostToDevice, s[i%num_streams]);
    k<<<1,1,0,s[i%num_streams]>>>();
    cudaMemcpyAsync(h2, d2, ratio*ds*sizeof(d2[0]), cudaMemcpyDeviceToHost, s[i%num_streams]);
    }
  cudaDeviceSynchronize();
  t1 = dtime_usec(t1);
  printf("method 1 total time: %lu microseconds\n", t1);

  t1 = dtime_usec(0);
  for (int i = 0; i < iter; i++){
    cudaMemcpyAsync(d1, h1, ds*sizeof(d1[0]), cudaMemcpyHostToDevice, s[0]);
    k<<<1,1,0,s[0]>>>();
    cudaStreamSynchronize(s[0]);
    cudaMemcpyAsync(h2, d2, ratio*ds*sizeof(d2[0]), cudaMemcpyDeviceToHost, s[1]);
    }
  cudaDeviceSynchronize();
  t1 = dtime_usec(t1);
  printf("method 2 total time: %lu microseconds\n", t1);
}
$ nvcc -arch=sm_70 -o t1423 t1423.cu
$ ./t1423 2
method 1 total time: 94611 microseconds
method 2 total time: 94961 microseconds
$ ./t1423 3
method 1 total time: 125426 microseconds
method 2 total time: 125457 microseconds
$ ./t1423 4
method 1 total time: 156850 microseconds
method 2 total time: 156973 microseconds
$ ./t1423 10
method 1 total time: 345942 microseconds
method 2 total time: 346086 microseconds
$ ./t1423 20
method 1 total time: 672051 microseconds
method 2 total time: 661013 microseconds
$

CUDA 10, Tesla V100, CentOS 7

njuffa · January 19, 2019, 7:59am

Using Robert Crovella’s code, Quadro P2000 on Win 7 Professional with WDDM driver:

method 1 total time: 1.27161e+006 microseconds
method 2 total time: 1.46316e+006 microseconds

After I reduced the kernel delay by a factor of 10 to account for the much slower GPU, I got:

C:\Users\Norbert\My Programs>overlap 3
method 1 total time: 378569 microseconds
method 2 total time: 376587 microseconds

C:\Users\Norbert\My Programs>overlap 4
method 1 total time: 499199 microseconds
method 2 total time: 507514 microseconds

C:\Users\Norbert\My Programs>overlap 5
method 1 total time: 626599 microseconds
method 2 total time: 634480 microseconds

C:\Users\Norbert\My Programs>overlap 6
method 1 total time: 747044 microseconds
method 2 total time: 759295 microseconds

C:\Users\Norbert\My Programs>overlap 7
method 1 total time: 862392 microseconds
method 2 total time: 889221 microseconds

C:\Users\Norbert\My Programs>overlap 1
method 1 total time: 125519 microseconds
method 2 total time: 125478 microseconds

C:\Users\Norbert\My Programs>overlap 2
method 1 total time: 245150 microseconds
method 2 total time: 249437 microseconds

C:\Users\Norbert\My Programs>overlap 3
method 1 total time: 366041 microseconds
method 2 total time: 371544 microseconds

C:\Users\Norbert\My Programs>overlap 4
method 1 total time: 487535 microseconds
method 2 total time: 495423 microseconds

C:\Users\Norbert\My Programs>overlap 5
method 1 total time: 611905 microseconds
method 2 total time: 619152 microseconds

C:\Users\Norbert\My Programs>overlap 6
method 1 total time: 734963 microseconds
method 2 total time: 743045 microseconds

C:\Users\Norbert\My Programs>overlap 7
method 1 total time: 857447 microseconds
method 2 total time: 866784 microseconds

C:\Users\Norbert\My Programs>overlap 20
method 1 total time: 2.45959e+006 microseconds
method 2 total time: 2.47688e+006 microseconds

Topic		Replies	Views
Copies between CPU and GPU CUDA Programming and Performance	8	5344	November 3, 2009
How to Overlap Data Transfers in CUDA C/C++ Technical Blog	23	2212	January 18, 2023
Question on Stream, Connection and Performance CUDA Programming and Performance hw , cuda	6	1150	February 23, 2024
Unpredictable nature of GPU action timing in Nsight CUDA Programming and Performance	10	1202	October 18, 2015
Cannot get any stream parallelism. CUDA Programming and Performance	13	1279	December 31, 2019
Data transfers are slower when overlapped than when running sequentially CUDA Programming and Performance	9	1400	September 29, 2021
CUDA 4-way overlap and cudaMemcpyAsync Mistake nvidia 4-way overlap technique CUDA Programming and Performance	0	904	May 11, 2012
multi task parallelization with cuda streams ? CUDA Programming and Performance	7	1452	September 14, 2017
Time intervals and non-concurrent in multi streaming CUDA Programming and Performance cuda	6	572	April 6, 2023
Overlapping kernel execution and data transfer CUDA Programming and Performance	9	3395	May 10, 2017

Copy-Compute Overlap Performance

Related topics