Copy-Compute Overlap Performance

I’ve been testing some code on a GPU and I found a surprising performance results when processing data in 2 CUDA streams on a V100. In every example I’ve seen where two streams are used to overlap compute and copy operations, each stream is processing independent data. My code was structured this way. As an experiment, I tried rearranging my code so there is a dedicated compute stream and a dedicated copy stream. When I did this my code ran twice as fast.

I have no idea why this would happen. My problem has a typical copy host to device - compute - copy device to host pattern, but the first copy takes a negligible amount of time and the device to host copy takes about the same time as the compute operation.

So, my first arrangement, in pseudocode was:

for i = 1 to num_iterations
    stream = streams[i % num_streams] // num_streams = 2
    copy input data from host to device in stream
    compute in stream
    copy output data from device to host in stream

Each iteration would alternate streams. I did not issue like operations together like some blogs suggest, because I’ve read that isn’t necessary on new GPUs.

My second arrangement was:

for i = 1 to num_iterations
    copy input data from host to device in <b>compute stream</b> // Takes almost no time
    compute in <b>compute stream</b>
    synchronize to prevent the copy stream from starting too soon
    copy output data from device to host in <b>copy stream</b>

Does anyone have good intuition for why the second arrangement would be twice as fast? Is this indicative of a bug or error I haven’t discovered yet?

Basically, I am suspicious because it worked so well and I’ve never seen any examples of streams being organized this way.

which previous copy, exactly? Define this step exactly. What API call(s) are you using.

I think I am wrong there. There is only synchronization to prevent the copy stream from starting too soon. I am using cudaStreamSynchronize(compute_stream). I updated my previous post.

I don’t see anything like a 2x speedup:

$ cat t1423.cu
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

unsigned long long dtime_usec(unsigned long long start){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

const long long delay_t = 42588075LL;
__global__ void k(){

  long long start = clock64();
  while (clock64() < start + delay_t);
}

const int default_iter = 10;
const int ds = 1000000;
const int ratio  = 100;
const int num_streams = 2;

int main(int argc, char *argv[]){

  int iter = default_iter;
  if (argc > 1) iter = atoi(argv[1]);
  cudaStream_t s[num_streams];
  for (int i = 0; i < num_streams; i++)
    cudaStreamCreate(&s[i]);
  int *d1, *d2, *h1, *h2;
  cudaMalloc(&d1, ds*sizeof(d1[0]));
  cudaMalloc(&d2, ratio*ds*sizeof(d2[0]));
  cudaHostAlloc(&h1, ds*sizeof(h1[0]), cudaHostAllocDefault);
  cudaHostAlloc(&h2, ratio*ds*sizeof(h2[0]), cudaHostAllocDefault);

  unsigned long long t1 = dtime_usec(0);
  for (int i = 0; i < iter; i++){
    cudaMemcpyAsync(d1, h1, ds*sizeof(d1[0]), cudaMemcpyHostToDevice, s[i%num_streams]);
    k<<<1,1,0,s[i%num_streams]>>>();
    cudaMemcpyAsync(h2, d2, ratio*ds*sizeof(d2[0]), cudaMemcpyDeviceToHost, s[i%num_streams]);
    }
  cudaDeviceSynchronize();
  t1 = dtime_usec(t1);
  printf("method 1 total time: %lu microseconds\n", t1);

  t1 = dtime_usec(0);
  for (int i = 0; i < iter; i++){
    cudaMemcpyAsync(d1, h1, ds*sizeof(d1[0]), cudaMemcpyHostToDevice, s[0]);
    k<<<1,1,0,s[0]>>>();
    cudaStreamSynchronize(s[0]);
    cudaMemcpyAsync(h2, d2, ratio*ds*sizeof(d2[0]), cudaMemcpyDeviceToHost, s[1]);
    }
  cudaDeviceSynchronize();
  t1 = dtime_usec(t1);
  printf("method 2 total time: %lu microseconds\n", t1);
}
$ nvcc -arch=sm_70 -o t1423 t1423.cu
$ ./t1423 2
method 1 total time: 94611 microseconds
method 2 total time: 94961 microseconds
$ ./t1423 3
method 1 total time: 125426 microseconds
method 2 total time: 125457 microseconds
$ ./t1423 4
method 1 total time: 156850 microseconds
method 2 total time: 156973 microseconds
$ ./t1423 10
method 1 total time: 345942 microseconds
method 2 total time: 346086 microseconds
$ ./t1423 20
method 1 total time: 672051 microseconds
method 2 total time: 661013 microseconds
$

CUDA 10, Tesla V100, CentOS 7

Using Robert Crovella’s code, Quadro P2000 on Win 7 Professional with WDDM driver:

method 1 total time: 1.27161e+006 microseconds
method 2 total time: 1.46316e+006 microseconds

After I reduced the kernel delay by a factor of 10 to account for the much slower GPU, I got:

C:\Users\Norbert\My Programs>overlap 3
method 1 total time: 378569 microseconds
method 2 total time: 376587 microseconds

C:\Users\Norbert\My Programs>overlap 4
method 1 total time: 499199 microseconds
method 2 total time: 507514 microseconds

C:\Users\Norbert\My Programs>overlap 5
method 1 total time: 626599 microseconds
method 2 total time: 634480 microseconds

C:\Users\Norbert\My Programs>overlap 6
method 1 total time: 747044 microseconds
method 2 total time: 759295 microseconds

C:\Users\Norbert\My Programs>overlap 7
method 1 total time: 862392 microseconds
method 2 total time: 889221 microseconds

C:\Users\Norbert\My Programs>overlap 1
method 1 total time: 125519 microseconds
method 2 total time: 125478 microseconds

C:\Users\Norbert\My Programs>overlap 2
method 1 total time: 245150 microseconds
method 2 total time: 249437 microseconds

C:\Users\Norbert\My Programs>overlap 3
method 1 total time: 366041 microseconds
method 2 total time: 371544 microseconds

C:\Users\Norbert\My Programs>overlap 4
method 1 total time: 487535 microseconds
method 2 total time: 495423 microseconds

C:\Users\Norbert\My Programs>overlap 5
method 1 total time: 611905 microseconds
method 2 total time: 619152 microseconds

C:\Users\Norbert\My Programs>overlap 6
method 1 total time: 734963 microseconds
method 2 total time: 743045 microseconds

C:\Users\Norbert\My Programs>overlap 7
method 1 total time: 857447 microseconds
method 2 total time: 866784 microseconds

C:\Users\Norbert\My Programs>overlap 20
method 1 total time: 2.45959e+006 microseconds
method 2 total time: 2.47688e+006 microseconds