multi task parallelization with cuda streams ?

main description:
I have got a parallelization application to process one data,there are about 6 kernels in the whole application.And I can get the correct result with this application. The question is the following:

Now assume that I got 10 datasets, generally, I will deal these data with a for loop,but to reach a higher speedup,I try to do this with cuda streams. And I want to get a effect as the following description:

Assume that dealing one dataset spend 10s, then deal these ten datasets will spend 10s×10(may be less). I want to get a 10 speedup with cuda streams,which is dealing 10 datasets will spend 10s(although I think this is impossible).But compared with the for loop version, the cuda streams version have seen no improvements.(I think there should be some improvements more or less)

The following is the main code with streams:

for (int i = 0; i < nstreams; i++)
		{
			checkCudaErrors(cudaMemcpyAsync(dev_X + i * m1 * n, h_X + i * m1 * n, m1*n * sizeof(float), cudaMemcpyHostToDevice, streams[i]));
			
			dim3 sumGrid(4, m1);
			dim3 sumBlock(1024, 1);

			int sharedSize = sumBlock.x * sizeof(float);

			sumReduction_kernel << <sumGrid, sumBlock, sharedSize, streams[i] >> > (dev_Xmean + i * m1, dev_X + i * m1 * n, m1, n);
			printf("%s\n", cudaGetErrorString(cudaGetLastError()));

			sub1_kernel << <sumGrid, sumBlock, 0, streams[i] >> > (dev_XFinal + i * m1 * n, dev_X + i * m1 * n, dev_Xmean + i * m1, m1, n);
			printf("%s\n", cudaGetErrorString(cudaGetLastError()));
			checkCudaErrors(cudaMemcpyAsync(h_Xfinal + i * m1 * n, dev_XFinal + i * m1 * n, sizeof(float) * m1 * n, cudaMemcpyDeviceToHost, streams[i]));
			
			status = culaDeviceSgemm(
				'T',
				'N',
				m1, m1, n,
				CNSTn,
				dev_XFinal + i * m1 * n, n,
				dev_XFinal + i * m1 * n, n,
				CNST0,
				dev_sigma + i * m1 * m1, m1
			);
			checkStatus(status);

			status = culaDeviceSgetrf(m1, m1, dev_sigma + i * m1 * m1, m1, (culaDeviceInt*)dev_ipiv + i * m1);
			checkStatus(status);
			status = culaDeviceSgetri(m1, dev_sigma + i * m1 * m1, m1, (culaDeviceInt*)dev_ipiv + i * m1);
			checkStatus(status);
			printf("%s\n", "CULA inverse had done!");
			status = culaDeviceSgemm(
				'N',
				'T',
				n, m1, m1,
				CNST1,
				dev_XFinal + i * m1 * n, n,
				dev_sigma + i * m1 * m1, m1,
				CNST0,
				dev_buffer + i * m1 * n, n
			);
			checkStatus(status);
			dist_kernel << <4, 1024, 0, streams[i] >> > (dev_buffer + i * m1 * n, dev_XFinal +i * m1 * n, dev_dist + i * n, n, m1);
			printf("%s\n", cudaGetErrorString(cudaGetLastError()));

			checkCudaErrors(cudaMemcpyAsync(host_dist + i * n, dev_dist + i * n, sizeof(float) * n, cudaMemcpyDeviceToHost, streams[i]));

		}

I did use CULA library to do the matrix multiplication and inversion,and in Nsight I have seen all the CULA operation in the default stream.

Hope you can give some analysis and suggestions about this code about why I can’t get an obvious improvements.

(And any other way to do a muti task parallelization ?)

device : GTX1060 6GB

thank you very much

Why do you expect obvious improvements by using streams?

Streams provide obvious benefits in one or both of two areas:

1 increasing the “exposed parallelism” in the case when a kernel launch is too small to occupy the GPU (concurrent kernels)
2 overlap of copy and compute operations, to shorten the overall application timeline

If you are going after the first item, your efforts may be fruitless. Your kernel launch is probably already large enough to fully occupy the GPU, resulting in little or no opportunity for kernel concurrency

If you are going after the second item, a straightforward process would begin by using the profiler to ascertain the expected benefit of overlap, and then to also confirm that the code methods used (e.g. cudaMemcpyAsync, etc.) actually do result in the desired overlap pattern.

It’s quite possible that using a library like CULA will hamper your efforts. Certainly having a library call issue work to the default stream is not conducive to these goals.

on cpu, you need to run multiple threads to load all cores with work. on gpus, each kernel call is parallel for loop in essence. it represents many independent jobs, and thousands of gpu hardware threads are busy executing these jobs

Hi, txbob
First thank you for your reply

By using streams, I want to let the several different streams executing in parallel to reduce the overall execution time at some extent.

Maybe I am going after the second item, I have seen a overlap of dist_kernel and sumReduction_kernel between two consequent streams in nsight. And there are no other overlap except this.(I don’t know why profiler always stop at ‘generating timeline’,and it can’t generate it.)

So if there are other way to do a multi task on one GPU like what I have said,now I just want to do this to get a higher speedup.

thanks

Hi, BulatZiganshin
Also thanks for your attention

I almost got what you have said, but there are really no way to do a task parallelization?
Even if there are little improvements on the total execution time.

Thank you

well, if you are interested, i can give more detailed explanation:

call to a CUDA kernel looks like kernel<<>>(params). It’s equivalent to the loop

for (i=0; i<N; i++) kernel(i, params…)

Now you are looking at this loop and say to yourself “Hey, we can split this work into 4 parts, and run each part on separate core”. And yes, it’s good optimization strategy. For CPU

GPU has thousands of cores, and entire idea behind CUDA kernel calls is that you give GPU many thousands of similar jobs and GPU splits the work between ALL its cores

Now you can see that you are trying to repeat very basic, fundamental CUDA functionality and of course this doesn’t work

So, unlike CPU, you don’t need to implement parallel execution yourself - it’s already here. There are some more complex situations, which were described by txbob, but in most situations streams can’t improve performance since each kernel call is already utilize all GPU cores

BulatZiganshin,

yes,now I am hungering everything about CUDA. I feel I’m not so clear about some basic working principle.

txbob had explained many confusions for me.

It’s just very kind of you. many thanks

BulatZiganshin,

yes,now I am hungering everything about CUDA. I feel I’m not so clear about some basic working principle.

txbob had explained many confusions for me.

It’s just very kind of you. many thanks