streams in Multi-gpu system

i just started implementing an application using multi-gpu architecture, when i use each stream per gpu the application work very well, but i don’t know how i can use multiple streams for each gpu in order to enhance occupancy.

Hi abde, first this link might help answer your question.

Secondly, why do you require multi-stream per gpu?

Assuming you have 2 GPU’s, your data transfer would need to be at the same rate it takes to complete one data set to run through your kernel.

The only reason that you would need multi-stream would be in a case where you have only a small amount of data that needs to be transferred to keep up with all the data that each stream would need, but at that point it might be better to just operate multi-stream on 1 GPU.

Obviously, I don’t know your use case but that link should help provide some information for you.

This article in IEEE might also be more of what you are looking for.
“Effective multi-GPU communication using multiple CUDA streams and threads.”
If you google it you should be able to find a copy without required an IEEE subscription (I found one easily.)

Hi bha4395, i want to overlap the execution of kernels and data transfer in different GPUs, for example execute 4 kernels 2 to per GPU in parallel.

I mean I understand that, but will you effectively be able to do so with transfer speeds?

I mean I guess it makes sense if you’re transfer time is roughly 1/6 of your kernel run-time due to the fact that you likely also have to account for transfer time back to the CPU.

Unfortunately, I don’t know that I can help you at all. Hopefully someone else will click on your post and be of more assistance.

Maybe a quick question, have you done multiple streams for one device before?
My brief knowledge regarding multi-gpu streams are that the streams are specifically linked to the active GPU.
Assuming so, could you not just create say 3 streams per GPU by just iterating through all of the GPUs and then calling the exact 3 streams per GPU. You could even choose the same 3 streams across each GPU if I’m not mistaken.

yes i did multiples streams for one GPU and it works, but when i share the kernels in two GPUs they not overlap. i used openmp to create on thread per GPU and inside the thread create streams but not working for me.

Have you checked to see that maybe the reason why they aren’t overlapping is because you cannot transfer data fast enough to feed both GPUs as well as their multiple streams with enough data sets?

Another question regarding background, I’m assuming you have gotten your program to the point where, while not overlapping execution, is executing on both GPUs? This is easier to do than what you are suggesting because utilizing the default streams still allows this type of operation to work. In the case with multiple streams you need to explicitly create new streams for each device, making sure that you do switch devices and create new streams, while also not utilizing the default stream.

bha4395 thank you very much for taking the time to answer my question, and this my code

int main()

	const int size_float = N_x * N_y * N_z * sizeof(float);
        const int SIZE=2;

	// Or in Pragma
#pragma omp parallel num_threads(2)
		int i = omp_get_thread_num(); 
		int j;
		printf("%d\n", i);

			//int i;
			float **tmp_d = (float **)malloc(sizeof(float *) * SIZE);
			float **wb_d = (float **)malloc(sizeof(float *) * SIZE);

			float **tmp_h = (float **)malloc(sizeof(float *) * SIZE);
			float **wb_h = (float **)malloc(sizeof(float *) * SIZE);

			cudaStream_t *stream = (cudaStream_t *)malloc(sizeof(cudaStream_t) * SIZE);

			for (j = 0; j < SIZE; j++)
				cudaMalloc((void**)&tmp_d[j], size_float);
				cudaMalloc((void**)&wb_d[j], size_float);

				cudaMallocHost((void **)&tmp_h[j], size_float);
				cudaMallocHost((void **)&wb_h[j], size_float);

			for (j = 0; j < SIZE; j++)

			for (j = 0; j < SIZE; j++){
				checkCuda(cudaMemcpyAsync((void *)wb_d[j], (void *)wb_h[j], size_float, cudaMemcpyHostToDevice, stream[j]));
				checkCuda(cudaMemcpyAsync((void *)tmp_d[j], (void *)tmp_h[j], size_float, cudaMemcpyHostToDevice, stream[j]));

			for (j = 0; j < SIZE; j++){
				my_kernel << <dimGrid, dimBlock, 0, stream[j] >> >(tmp_d[j], wb_d[j]);

			for (j = 0; j < SIZE; j++){
				checkCuda(cudaMemcpyAsync((void *)tmp_h[j], (void *)tmp_d[j], size_float, cudaMemcpyDeviceToHost, stream[j]));

			for (j = 0; j < SIZE; j++){

			for (j = 0; j < SIZE; j++)




and this the obtained results in Visual profiler:

You’ll have to excuse me as I do not know OpenMP and haven’t used it before.

I can’t see the second set of streams for the second GPU but I’ll assume they exist but got cropped off.

First, I do not think you are properly cleaning up after yourself as you call cudaStreamDestroy only once per OpenMP thread. It should be called twice one for each stream.

Second, due to me not knowing OpenMP, it seems to be that there is a blocking call somewhere preventing the data transfer to the GTX 750 Ti. This is apparent to me due to the fact that the GPU does no work until the data is transferred over. I might look into some blocking calls that might exist or look into making sure the cudaStreamSynchronize isn’t blocking other parts of your program.