streams in Multi-gpu system

abde · May 22, 2017, 3:43pm

i just started implementing an application using multi-gpu architecture, when i use each stream per gpu the application work very well, but i don’t know how i can use multiple streams for each gpu in order to enhance occupancy.

bha4395 · May 22, 2017, 5:22pm

Hi abde, first this link might help answer your question.[url]CUDA: do I need different streams on multiple GPUs to execute in parallel? - Stack Overflow

Secondly, why do you require multi-stream per gpu?

Assuming you have 2 GPU’s, your data transfer would need to be at the same rate it takes to complete one data set to run through your kernel.

The only reason that you would need multi-stream would be in a case where you have only a small amount of data that needs to be transferred to keep up with all the data that each stream would need, but at that point it might be better to just operate multi-stream on 1 GPU.

Obviously, I don’t know your use case but that link should help provide some information for you.

This article in IEEE might also be more of what you are looking for.
“Effective multi-GPU communication using multiple CUDA streams and threads.”
If you google it you should be able to find a copy without required an IEEE subscription (I found one easily.)

abde · May 22, 2017, 6:24pm

Hi bha4395, i want to overlap the execution of kernels and data transfer in different GPUs, for example execute 4 kernels 2 to per GPU in parallel.

bha4395 · May 22, 2017, 7:29pm

I mean I understand that, but will you effectively be able to do so with transfer speeds?

I mean I guess it makes sense if you’re transfer time is roughly 1/6 of your kernel run-time due to the fact that you likely also have to account for transfer time back to the CPU.

Unfortunately, I don’t know that I can help you at all. Hopefully someone else will click on your post and be of more assistance.

Maybe a quick question, have you done multiple streams for one device before?
My brief knowledge regarding multi-gpu streams are that the streams are specifically linked to the active GPU.
Assuming so, could you not just create say 3 streams per GPU by just iterating through all of the GPUs and then calling the exact 3 streams per GPU. You could even choose the same 3 streams across each GPU if I’m not mistaken.

abde · May 22, 2017, 7:58pm

yes i did multiples streams for one GPU and it works, but when i share the kernels in two GPUs they not overlap. i used openmp to create on thread per GPU and inside the thread create streams but not working for me.

bha4395 · May 22, 2017, 9:29pm

Have you checked to see that maybe the reason why they aren’t overlapping is because you cannot transfer data fast enough to feed both GPUs as well as their multiple streams with enough data sets?

Another question regarding background, I’m assuming you have gotten your program to the point where, while not overlapping execution, is executing on both GPUs? This is easier to do than what you are suggesting because utilizing the default streams still allows this type of operation to work. In the case with multiple streams you need to explicitly create new streams for each device, making sure that you do switch devices and create new streams, while also not utilizing the default stream.

abde · May 22, 2017, 10:07pm

bha4395 thank you very much for taking the time to answer my question, and this my code

int main()
{

	const int size_float = N_x * N_y * N_z * sizeof(float);
        const int SIZE=2;
	omp_set_dynamic(0);

	// Or in Pragma
#pragma omp parallel num_threads(2)
	{
		int i = omp_get_thread_num(); 
		int j;
		printf("%d\n", i);
			cudaSetDevice(i);

			//int i;
			float **tmp_d = (float **)malloc(sizeof(float *) * SIZE);
			float **wb_d = (float **)malloc(sizeof(float *) * SIZE);

			float **tmp_h = (float **)malloc(sizeof(float *) * SIZE);
			float **wb_h = (float **)malloc(sizeof(float *) * SIZE);

			cudaStream_t *stream = (cudaStream_t *)malloc(sizeof(cudaStream_t) * SIZE);

			for (j = 0; j < SIZE; j++)
			{
				cudaSetDevice(i);
				cudaMalloc((void**)&tmp_d[j], size_float);
				cudaMalloc((void**)&wb_d[j], size_float);

				cudaMallocHost((void **)&tmp_h[j], size_float);
				cudaMallocHost((void **)&wb_h[j], size_float);
				cudaStreamCreate(&stream[j]);
			}

			for (j = 0; j < SIZE; j++)
				init_temp(tmp_h[j]);

			for (j = 0; j < SIZE; j++){
				cudaSetDevice(i);
				checkCuda(cudaMemcpyAsync((void *)wb_d[j], (void *)wb_h[j], size_float, cudaMemcpyHostToDevice, stream[j]));
				checkCuda(cudaMemcpyAsync((void *)tmp_d[j], (void *)tmp_h[j], size_float, cudaMemcpyHostToDevice, stream[j]));
			}

			for (j = 0; j < SIZE; j++){
				my_kernel << <dimGrid, dimBlock, 0, stream[j] >> >(tmp_d[j], wb_d[j]);
			}

			for (j = 0; j < SIZE; j++){
				checkCuda(cudaMemcpyAsync((void *)tmp_h[j], (void *)tmp_d[j], size_float, cudaMemcpyDeviceToHost, stream[j]));
			}

			for (j = 0; j < SIZE; j++){
				cudaSetDevice(i);
				checkCuda(cudaStreamSynchronize(stream[j]));
			}

			for (j = 0; j < SIZE; j++)
			{
				cudaSetDevice(i);
				cudaFreeHost(tmp_h[j]);
				cudaFreeHost(wb_h[j]);
				cudaFree(tmp_d[j]);
				cudaFree(wb_d[j]);
			}

			cudaStreamDestroy(stream[i]);

			free(temp_d);
			free(wb_d);
			free(temp_h);
			free(wb_h);
	}

	cudaDeviceReset();
	system("pause");
}

and this the obtained results in Visual profiler:

bha4395 · May 23, 2017, 1:55pm

You’ll have to excuse me as I do not know OpenMP and haven’t used it before.

I can’t see the second set of streams for the second GPU but I’ll assume they exist but got cropped off.

First, I do not think you are properly cleaning up after yourself as you call cudaStreamDestroy only once per OpenMP thread. It should be called twice one for each stream.

Second, due to me not knowing OpenMP, it seems to be that there is a blocking call somewhere preventing the data transfer to the GTX 750 Ti. This is apparent to me due to the fact that the GPU does no work until the data is transferred over. I might look into some blocking calls that might exist or look into making sure the cudaStreamSynchronize isn’t blocking other parts of your program.

Topic		Replies	Views
Using CUDA to run many instances CUDA Programming and Performance	10	3318	April 1, 2012
confusions about CUDA streams CUDA Programming and Performance	5	800	July 30, 2017
Cannot force kernels to concurrent execution CUDA Programming and Performance	8	5545	April 28, 2012
Multi stream multi GPU CUDA Programming and Performance cuda	9	1026	October 6, 2023
My streams are not running concurrently CUDA Programming and Performance	7	1764	March 6, 2018
multi task parallelization with cuda streams ? CUDA Programming and Performance	7	1448	September 14, 2017
Multiple GPU computing CUDA Programming and Performance	8	7878	May 7, 2008
Overhead of using more than one streams? CUDA Programming and Performance	5	6175	April 14, 2009
Multiple Parallel GPUs CUDA Programming and Performance	4	2494	October 8, 2008
multi-GPUs with streams. Seems only one device overlapping copies CUDA Programming and Performance	9	1629	October 30, 2015

streams in Multi-gpu system

Related topics