can't achieve cudaMemcpyAsync and kernel concurrency

xliu · August 25, 2017, 3:55am

hi folks,

I’m writing a 360 video stitching application, and having a concurrency issue. The attatched pic shows the serial version.

My plan is to assign the odd frames to one stream, and even frames to another stream, thus the memcpy and kernel can be parallel between 2 streams.

My code is below. The problem is that I can’t see the expected concurrency, all the memcpy (both D2H, H2D) and kernel runs serially. What did I miss?

cudaStream_t stream[2];

for (int i = 0; i < 2; ++i)
	cudaStreamCreate(&stream[i]);

...

for (int i = 0; i < src_n; ++i)
{
	h_arg.iimg[i] = inbuf[stream_idx][i];
	cudaHostRegister(src[i].y.data, src[i].y.width * src[i].y.height, cudaHostRegisterDefault);
	checkCudaErrors(cudaMemcpyAsync(h_arg.iimg[i].y.data, src[i].y.data, src[i].y.width * src[i].y.height, cudaMemcpyHostToDevice, stream[stream_idx]));
	cudaHostRegister(src[i].u.data, src[i].u.width * src[i].u.height, cudaHostRegisterDefault);
	checkCudaErrors(cudaMemcpyAsync(h_arg.iimg[i].u.data, src[i].u.data, src[i].u.width * src[i].u.height, cudaMemcpyHostToDevice, stream[stream_idx]));
	cudaHostRegister(src[i].v.data, src[i].v.width * src[i].v.height, cudaHostRegisterDefault);
	checkCudaErrors(cudaMemcpyAsync(h_arg.iimg[i].v.data, src[i].v.data, src[i].v.width * src[i].v.height, cudaMemcpyHostToDevice, stream[stream_idx]));
}
h_arg.oimg = outbuf[stream_idx];
cudaHostRegister(dst->y.data, dst->y.width * dst->y.height, cudaHostRegisterDefault);
cudaHostRegister(dst->u.data, dst->u.width * dst->u.height, cudaHostRegisterDefault);
cudaHostRegister(dst->v.data, dst->v.width * dst->v.height, cudaHostRegisterDefault);

kernel_stitch << <dim_grid, dim_block, 0 , stream[stream_idx]>> > (h_arg);

checkCudaErrors(cudaMemcpyAsync(dst->y.data, h_arg.oimg.y.data, dst->y.width * dst->y.height, cudaMemcpyDeviceToHost, stream[stream_idx]));
checkCudaErrors(cudaMemcpyAsync(dst->u.data, h_arg.oimg.u.data, dst->u.width * dst->u.height, cudaMemcpyDeviceToHost, stream[stream_idx]));
checkCudaErrors(cudaMemcpyAsync(dst->v.data, h_arg.oimg.v.data, dst->v.width * dst->v.height, cudaMemcpyDeviceToHost, stream[stream_idx]));

// sync and output previous frame (in previous stream)
uint32_t prev_idx = (stream_idx + stream_num - 1) % stream_num;
checkCudaErrors(cudaStreamSynchronize(stream[prev_idx]));

for (int i = 0; i < src_n; ++i)
{
	if (prev_inbuf[i].y.data)
		cudaHostUnregister(prev_inbuf[i].y.data);
	if (prev_inbuf[i].u.data)
		cudaHostUnregister(prev_inbuf[i].u.data);
	if (prev_inbuf[i].v.data)
		cudaHostUnregister(prev_inbuf[i].v.data);
	prev_inbuf[i] = src[i];
}
if (prev_outbuf.y.data)
	cudaHostUnregister(prev_outbuf.y.data);
if (prev_outbuf.u.data)
	cudaHostUnregister(prev_outbuf.u.data);
if (prev_outbuf.v.data)
	cudaHostUnregister(prev_outbuf.v.data);
prev_outbuf = *dst;

stream_idx = (stream_idx + 1) % stream_num;

xliu · August 27, 2017, 12:21pm

I figured it out finally.

TX1 has only one async engine. So the sequence of issuing actions to streams is a little trickier than the one with multiple engines.

the trick is well covered in this link.
https://devblogs.nvidia.com/parallelforall/how-overlap-data-transfers-cuda-cc/

AastaLLL · August 28, 2017, 1:45am

Hi,

Here is our CUDA document for the concurrent kernel for your reference.
[url]Programming Guide :: CUDA Toolkit Documentation

Topic		Replies	Views
Unable to achieve concurrency in kernel launches CUDA Programming and Performance	2	872	February 12, 2016
My streams are not running concurrently CUDA Programming and Performance	7	1764	March 6, 2018
Cannot force kernels to concurrent execution CUDA Programming and Performance	8	5545	April 28, 2012
Asynchronous multi streaming: not working... CUDA Programming and Performance	2	515	May 13, 2018
cudaMemcpyAsync CUDA Programming and Performance	10	20504	October 16, 2015
Streams concurrency bad performance CUDA Programming and Performance	3	2029	June 13, 2012
why is cudaMemsetAsync(), cudaMemcpyAsync(), or even cudaEventRecord() killing parallel kernel exec CUDA Programming and Performance	2	4669	April 4, 2013
about streaming style sample code in Programming Guide ... why such a style? CUDA Programming and Performance	5	1420	January 23, 2009
Ordering of cudaMemcpyAsync issued to separate streams CUDA Programming and Performance	4	583	February 5, 2019
Cannot get any stream parallelism. CUDA Programming and Performance	13	1272	December 31, 2019

can't achieve cudaMemcpyAsync and kernel concurrency

Related topics