I’m trying to get concurrent copy and execute to work consistently with the Quadro 2000m, the 280.26 driver and cuda 4.0 (the 285 drivers are much worse at the moment). I was wondering if anyone can tell me what I’m doing wrong, what I’m missing or why this is the expected behavior.
I’m doing, memcpyHtoD, kernel, memcpyDtoH on 4 streams
Running the code on stream 0 (no overlap) takes 6.6ms
Following code runs at 7.93ms and nsight shows no overlap
cudaMemcpyAsync(ddata1, hdata1, BUFSZ, cudaMemcpyHostToDevice, streams[0]);
kernel <<< grid, block, 0, streams[0] >>> (ddata1, ddata1);
cudaMemcpyAsync(hdata1, ddata1, BUFSZ, cudaMemcpyDeviceToHost, streams[0]);
cudaMemcpyAsync(ddata2, hdata2, BUFSZ, cudaMemcpyHostToDevice, streams[1]);
kernel <<< grid, block, 0, streams[1] >>> (ddata2, ddata2);
cudaMemcpyAsync(hdata2, ddata2, BUFSZ, cudaMemcpyDeviceToHost, streams[1]);
cudaMemcpyAsync(ddata3, hdata3, BUFSZ, cudaMemcpyHostToDevice, streams[2]);
kernel <<< grid, block, 0, streams[2] >>> (ddata3, ddata3);
cudaMemcpyAsync(hdata3, ddata3, BUFSZ, cudaMemcpyDeviceToHost, streams[2]);
cudaMemcpyAsync(ddata4, hdata4, BUFSZ, cudaMemcpyHostToDevice, streams[3]);
kernel <<< grid, block, 0, streams[3] >>> (ddata4, ddata4);
cudaMemcpyAsync(hdata4, ddata4, BUFSZ, cudaMemcpyDeviceToHost, streams[3]);
Following code shows (almost) full overlap and runs at 5ms
cudaMemcpyAsync(ddata1, hdata1, BUFSZ, cudaMemcpyHostToDevice, streams[0]);
cudaMemcpyAsync(ddata2, hdata2, BUFSZ, cudaMemcpyHostToDevice, streams[1]);
cudaMemcpyAsync(ddata3, hdata3, BUFSZ, cudaMemcpyHostToDevice, streams[2]);
cudaMemcpyAsync(ddata4, hdata4, BUFSZ, cudaMemcpyHostToDevice, streams[3]);
kernel <<< grid, block, 0, streams[0] >>> (ddata1, ddata1);
cudaMemcpyAsync(hdata1, ddata1, BUFSZ, cudaMemcpyDeviceToHost, streams[0]);
kernel <<< grid, block, 0, streams[1] >>> (ddata2, ddata2);
cudaMemcpyAsync(hdata2, ddata2, BUFSZ, cudaMemcpyDeviceToHost, streams[1]);
kernel <<< grid, block, 0, streams[2] >>> (ddata3, ddata3);
cudaMemcpyAsync(hdata3, ddata3, BUFSZ, cudaMemcpyDeviceToHost, streams[2]);
kernel <<< grid, block, 0, streams[3] >>> (ddata4, ddata4);
cudaMemcpyAsync(hdata4, ddata4, BUFSZ, cudaMemcpyDeviceToHost, streams[3]);
Following code shows write overleap (memcpyHtoD) but no read overlap (memcpyDtoH) and runs at 6.17ms
cudaMemcpyAsync(ddata1, hdata1, BUFSZ, cudaMemcpyHostToDevice, streams[0]);
cudaMemcpyAsync(ddata2, hdata2, BUFSZ, cudaMemcpyHostToDevice, streams[1]);
cudaMemcpyAsync(ddata3, hdata3, BUFSZ, cudaMemcpyHostToDevice, streams[2]);
cudaMemcpyAsync(ddata4, hdata4, BUFSZ, cudaMemcpyHostToDevice, streams[3]);
kernel <<< grid, block, 0, streams[0] >>> (ddata1, ddata1);
kernel <<< grid, block, 0, streams[1] >>> (ddata2, ddata2);
kernel <<< grid, block, 0, streams[2] >>> (ddata3, ddata3);
kernel <<< grid, block, 0, streams[3] >>> (ddata4, ddata4);
cudaMemcpyAsync(hdata1, ddata1, BUFSZ, cudaMemcpyDeviceToHost, streams[0]);
cudaMemcpyAsync(hdata2, ddata2, BUFSZ, cudaMemcpyDeviceToHost, streams[1]);
cudaMemcpyAsync(hdata3, ddata3, BUFSZ, cudaMemcpyDeviceToHost, streams[2]);
cudaMemcpyAsync(hdata4, ddata4, BUFSZ, cudaMemcpyDeviceToHost, streams[3]);
The way I see it, they all should behave the same and the first version is what I would actually want to do in a real program
Thanks