multi task parallelization with cuda streams ?

hlei · September 13, 2017, 3:01pm

main description:
I have got a parallelization application to process one data,there are about 6 kernels in the whole application.And I can get the correct result with this application. The question is the following:

Now assume that I got 10 datasets, generally, I will deal these data with a for loop,but to reach a higher speedup,I try to do this with cuda streams. And I want to get a effect as the following description:

Assume that dealing one dataset spend 10s, then deal these ten datasets will spend 10s×10(may be less). I want to get a 10 speedup with cuda streams,which is dealing 10 datasets will spend 10s(although I think this is impossible).But compared with the for loop version, the cuda streams version have seen no improvements.(I think there should be some improvements more or less)

The following is the main code with streams:

for (int i = 0; i < nstreams; i++)
		{
			checkCudaErrors(cudaMemcpyAsync(dev_X + i * m1 * n, h_X + i * m1 * n, m1*n * sizeof(float), cudaMemcpyHostToDevice, streams[i]));
			
			dim3 sumGrid(4, m1);
			dim3 sumBlock(1024, 1);

			int sharedSize = sumBlock.x * sizeof(float);

			sumReduction_kernel << <sumGrid, sumBlock, sharedSize, streams[i] >> > (dev_Xmean + i * m1, dev_X + i * m1 * n, m1, n);
			printf("%s\n", cudaGetErrorString(cudaGetLastError()));

			sub1_kernel << <sumGrid, sumBlock, 0, streams[i] >> > (dev_XFinal + i * m1 * n, dev_X + i * m1 * n, dev_Xmean + i * m1, m1, n);
			printf("%s\n", cudaGetErrorString(cudaGetLastError()));
			checkCudaErrors(cudaMemcpyAsync(h_Xfinal + i * m1 * n, dev_XFinal + i * m1 * n, sizeof(float) * m1 * n, cudaMemcpyDeviceToHost, streams[i]));
			
			status = culaDeviceSgemm(
				'T',
				'N',
				m1, m1, n,
				CNSTn,
				dev_XFinal + i * m1 * n, n,
				dev_XFinal + i * m1 * n, n,
				CNST0,
				dev_sigma + i * m1 * m1, m1
			);
			checkStatus(status);

			status = culaDeviceSgetrf(m1, m1, dev_sigma + i * m1 * m1, m1, (culaDeviceInt*)dev_ipiv + i * m1);
			checkStatus(status);
			status = culaDeviceSgetri(m1, dev_sigma + i * m1 * m1, m1, (culaDeviceInt*)dev_ipiv + i * m1);
			checkStatus(status);
			printf("%s\n", "CULA inverse had done!");
			status = culaDeviceSgemm(
				'N',
				'T',
				n, m1, m1,
				CNST1,
				dev_XFinal + i * m1 * n, n,
				dev_sigma + i * m1 * m1, m1,
				CNST0,
				dev_buffer + i * m1 * n, n
			);
			checkStatus(status);
			dist_kernel << <4, 1024, 0, streams[i] >> > (dev_buffer + i * m1 * n, dev_XFinal +i * m1 * n, dev_dist + i * n, n, m1);
			printf("%s\n", cudaGetErrorString(cudaGetLastError()));

			checkCudaErrors(cudaMemcpyAsync(host_dist + i * n, dev_dist + i * n, sizeof(float) * n, cudaMemcpyDeviceToHost, streams[i]));

		}

I did use CULA library to do the matrix multiplication and inversion,and in Nsight I have seen all the CULA operation in the default stream.

Hope you can give some analysis and suggestions about this code about why I can’t get an obvious improvements.

(And any other way to do a muti task parallelization ?)

device : GTX1060 6GB

thank you very much

Robert_Crovella · September 13, 2017, 5:40pm

Why do you expect obvious improvements by using streams?

Streams provide obvious benefits in one or both of two areas:

1 increasing the “exposed parallelism” in the case when a kernel launch is too small to occupy the GPU (concurrent kernels)
2 overlap of copy and compute operations, to shorten the overall application timeline

If you are going after the first item, your efforts may be fruitless. Your kernel launch is probably already large enough to fully occupy the GPU, resulting in little or no opportunity for kernel concurrency

If you are going after the second item, a straightforward process would begin by using the profiler to ascertain the expected benefit of overlap, and then to also confirm that the code methods used (e.g. cudaMemcpyAsync, etc.) actually do result in the desired overlap pattern.

It’s quite possible that using a library like CULA will hamper your efforts. Certainly having a library call issue work to the default stream is not conducive to these goals.

BulatZiganshin · September 13, 2017, 6:45pm

on cpu, you need to run multiple threads to load all cores with work. on gpus, each kernel call is parallel for loop in essence. it represents many independent jobs, and thousands of gpu hardware threads are busy executing these jobs

hlei · September 14, 2017, 1:51am

Hi, txbob
First thank you for your reply

By using streams, I want to let the several different streams executing in parallel to reduce the overall execution time at some extent.

Maybe I am going after the second item, I have seen a overlap of dist_kernel and sumReduction_kernel between two consequent streams in nsight. And there are no other overlap except this.(I don’t know why profiler always stop at ‘generating timeline’,and it can’t generate it.)

So if there are other way to do a multi task on one GPU like what I have said,now I just want to do this to get a higher speedup.

thanks

hlei · September 14, 2017, 1:58am

Hi, BulatZiganshin
Also thanks for your attention

I almost got what you have said, but there are really no way to do a task parallelization?
Even if there are little improvements on the total execution time.

Thank you

BulatZiganshin · September 14, 2017, 9:19am

well, if you are interested, i can give more detailed explanation:

call to a CUDA kernel looks like kernel<<>>(params). It’s equivalent to the loop

for (i=0; i<N; i++) kernel(i, params…)

Now you are looking at this loop and say to yourself “Hey, we can split this work into 4 parts, and run each part on separate core”. And yes, it’s good optimization strategy. For CPU

GPU has thousands of cores, and entire idea behind CUDA kernel calls is that you give GPU many thousands of similar jobs and GPU splits the work between ALL its cores

Now you can see that you are trying to repeat very basic, fundamental CUDA functionality and of course this doesn’t work

So, unlike CPU, you don’t need to implement parallel execution yourself - it’s already here. There are some more complex situations, which were described by txbob, but in most situations streams can’t improve performance since each kernel call is already utilize all GPU cores

hlei · September 14, 2017, 1:04pm

BulatZiganshin,

yes,now I am hungering everything about CUDA. I feel I’m not so clear about some basic working principle.

txbob had explained many confusions for me.

It’s just very kind of you. many thanks

hlei · September 14, 2017, 1:05pm

BulatZiganshin,

yes,now I am hungering everything about CUDA. I feel I’m not so clear about some basic working principle.

txbob had explained many confusions for me.

It’s just very kind of you. many thanks

Topic		Replies	Views
confusions about CUDA streams CUDA Programming and Performance	5	788	July 30, 2017
My streams are not running concurrently CUDA Programming and Performance	7	1717	March 6, 2018
Question on Stream, Connection and Performance CUDA Programming and Performance hw , cuda	6	880	February 23, 2024
Using CUDA to run many instances CUDA Programming and Performance	10	3186	April 1, 2012
How to effectively parallelize cuda kernel launches on CPU CUDA Programming and Performance	9	2975	January 19, 2018
Cannot get any stream parallelism. CUDA Programming and Performance	13	1226	December 31, 2019
Why kernel executions in different streams are not parallel? CUDA Programming and Performance	4	2316	April 29, 2019
cudaMemcpyAsync CUDA Programming and Performance	10	20049	October 16, 2015
Memset/memcpyDtoD implicitly synchronizes all streams -- a way to disable it? CUDA Programming and Performance	5	515	August 23, 2023
streams in Multi-gpu system CUDA Programming and Performance	7	5881	May 23, 2017

multi task parallelization with cuda streams ?

Related topics