CUDA concurrency problem - multi-GPU vector add

Hi,

I am trying to optimize a cuda application - multi-stream, multi-gpu - and it would be great if somebody could explain the profile. What troubles me is that its not starting on all GPUs at the same time while I expect it to. Here’s a quick summary of whats in the code:

cudaSetDevice(0);
cudaStreamCreate(&stream01);
cudaStreamCreate(&stream02);

cudaSetDevice(1);
cudaStreamCreate(&stream11);
cudaStreamCreate(&stream12);

cudaSetDevice(0);
cudaMemcpyAsync(... H2D, stream01); // assume transfer from host pinned mem to dev
kernel <<<grid, block, bytes, stream01>>>();
cudaMemcpyAsync(... D2H, stream01); // assume these

cudaMemcpyAsync(... H2D, stream02);
kernel <<<grid, block, bytes, stream02>>>();
cudaMemcpyAsync(... D2H, stream02);

cudaSetDevice(1);
cudaMemcpyAsync(... H2D, stream11);
kernel <<<grid, block, bytes, stream11>>>();
cudaMemcpyAsync(... D2H, stream11);

cudaMemcpyAsync(... H2D, stream12);
kernel <<<grid, block, bytes, stream12>>>();
cudaMemcpyAsync(... D2H, stream12);

Assume kernel just adds 8 vectors into one result vector in an obvious way. I want all GPUs to start simlutaneously which they do not do for “many streams per device”. Check out bellow profiles for streams = 4, 16 and 64 resp.



P.S. I’ll give more info if anybody wants it. I have summarized it for convenience sake. All comments appreciated. Thanks in advance!