Asynchronous kernel execution and memory not overlapping using CUDA stream!

andyleow · July 6, 2017, 6:19pm

Hi,

I’m trying to use CUDA stream to build a processing pipeline that overlaps my kernel launch and memory transfer from and to GPU. Although the kernel has been distributed to various stream, all the processing was actually serialised and the memcpy is not overlapping the kernel execution at all. I wonder if someone could have a look and advice me on this? I’m using CUDA 8, gtx 980, and have added the --default-stream per thread command before compiling using visual studio.

Below is the profiling results and code to overlap the kernel execution and memory transfer.

Thanks!
Andy

https://drive.google.com/open?id=0B4VtavMHEP2LTENqQ3VwOTRXSzg

for (h = 0; h < numFrame; h++) {
		for (i = 0; i < numAngle; i++) {
			//cudaStreamSynchronize(stream[sID]);
			printf("Recon image %d", i + h*numAngle);
			offSetIn = (h*numAngle + i)*numSample*numChannel;
			offSetOut = (h*numAngle + i)*imageHeight*imageWidth;

			h_t0_wl = h_t0[i] * h_fs + 0.5f;

			myparms.srcPtr = make_cudaPitchedPtr(d_RFIn + offSetIn, numSample * sizeof(float2), numSample, 1);
			myparms.dstPos = make_cudaPos(0, 0, numChannel*(i%nLayered));
			cudaMemcpy3DAsync(&myparms, stream[i]);

			kernel_DAS_DynApo8 << <dimGrid, dimBlock, 0, stream[i] >> > (d_LRI + offSetOut, d_pixelMapX, d_pixelMapZ, h_angle[i], imageHeight, imageWidth, numChannel, h_t0_wl, i%nLayered);
			cudaMemcpy2DAsync(h_LRI_r + offSetOut, sizeof(float), d_LRI + offSetOut, sizeof(float2), sizeof(float), imageHeight*imageWidth, cudaMemcpyDeviceToHost, stream[i]);
			cudaMemcpy2DAsync(h_LRI_i + offSetOut, sizeof(float), &(d_LRI[offSetOut].y), sizeof(float2), sizeof(float), imageHeight*imageWidth, cudaMemcpyDeviceToHost, stream[i]);

		}
	}
	cudaDeviceSynchronize();

Robert_Crovella · July 6, 2017, 6:54pm

on windows, WDDM command batching can interfere with seeing proper overlap/concurrency

andyleow · July 7, 2017, 10:18am

Thanks txbob. Do you think there might be other method other than using cudaStreamQuery to flush the command queue? It seem there is some overlapping in the memory transfer out but the memory transfer to GPU is still preventing the kernel from running ascynchronously.

External Media

for (h = 0; h < numFrame; h++) {
		for (i = 0; i < numAngle; i++) {
			//cudaStreamSynchronize(stream[sID]);
			printf("Recon image %d", i + h*numAngle);
			offSetIn = (h*numAngle + i)*numSample*numChannel;
			offSetOut = (h*numAngle + i)*imageHeight*imageWidth;

			h_t0_wl = h_t0[i] * h_fs + 0.5f;

			myparms.srcPtr = make_cudaPitchedPtr(d_RFIn + offSetIn, numSample * sizeof(float2), numSample, 1);
			myparms.dstPos = make_cudaPos(0, 0, numChannel*(i%nLayered));
			cudaMemcpy3DAsync(&myparms, stream[i]);

			kernel_DAS_DynApo8 << <dimGrid, dimBlock, 0, stream[i] >> > (d_LRI + offSetOut, d_pixelMapX, d_pixelMapZ, h_angle[i], imageHeight, imageWidth, numChannel, h_t0_wl, i%nLayered);
			cudaMemcpy2DAsync(h_LRI_r + offSetOut, sizeof(float), d_LRI + offSetOut, sizeof(float2), sizeof(float), imageHeight*imageWidth, cudaMemcpyDeviceToHost, stream[i]);
			cudaMemcpy2DAsync(h_LRI_i + offSetOut, sizeof(float), &(d_LRI[offSetOut].y), sizeof(float2), sizeof(float), imageHeight*imageWidth, cudaMemcpyDeviceToHost, stream[i]);
			cudaStreamQuery(stream[i]);
		}
	}
	cudaDeviceSynchronize();

Thanks!
Andy

Robert_Crovella · July 7, 2017, 1:17pm

The cudaMemcpy3DAsync is launched into the same stream as the kernel call, so clearly the kernel call within that loop iteration cannot run concurrently with that particular transfer to the device.

I don’t know of other methods to flush the command queue, but I’m sure they exist.

Topic		Replies	Views
Memory copy/set async to kernel execution in different stream CUDA Programming and Performance	5	1099	December 15, 2022
async memcopy/kernel from different contexts overlaping operations from different contexts.. CUDA Programming and Performance	9	2949	December 18, 2008
Cannot force kernels to concurrent execution CUDA Programming and Performance	8	5555	April 28, 2012
Why kernel executions in different streams are not parallel? CUDA Programming and Performance	4	2810	April 29, 2019
Concurrent Kernel executions & Data Transfers CUDA Programming and Performance cuda	3	640	March 8, 2023
Overlapping kernel execution and data transfer CUDA Programming and Performance	9	3446	May 10, 2017
cudaMemcpyAsync CUDA Programming and Performance	10	21082	October 16, 2015
Cuda Streams for Concurrent Kernel Calls CUDA Programming and Performance	1	2249	October 26, 2016
Concurrent copy & execution problem Device to host memory copy is not overlapped with kernel exe CUDA Programming and Performance	1	1770	June 23, 2010
Conditions for CUDA streams to overlap CUDA Programming and Performance	5	4386	June 9, 2013

Asynchronous kernel execution and memory not overlapping using CUDA stream!

Related topics