Problematic multi GPU execution

Hi all,

I’m trying to run a fairly complex kernel on multiple GPUs in parallel. Each run is independent and ideally I need them to run concurrently. So far, I haven’t been able to do that. Perhaps, I’m missing something in the setup and launch, and I’m attaching a very simple example that reproduces the problem. Kernel start and stop statements show that until device 0 is done, device 1 doesn’t start executing. According to Nvidia’s “Multi-GPU Programming” presentation, the syntax I’m using “should” work. The CUDA Programming Guide doesn’t have much on this subject, except a very basic syntax that doesn’t seem to work for me.

I use a server with 6 Tesla 2050 cards and CUDA 4.2.

The Nvidia supplied multi-GPU example (simpleMultiGPU) also runs sequentially, which makes me very suspicious. Running a kernel on multiple GPUs like this defeats the purpose of having several GPUs in the first place. Only one device is executing at any time and the rest are idle, which is clearly not ideal.

I’m hoping that the problem is on my end, and someone could suggest a way to get this to work.

Thanks in advance


/* main.cpp */

#include "gpu.h"

int main(int argc, char* argv[]){


	return 0;

/* gpu.h */

void compute();
/* */

#include <iostream>

#include "gpu.h"

#include <cuda_runtime.h>

#define THREADS 32

#define BLOCKS 1

#define DATA_SIZE 1600

__shared__ double data[DATA_SIZE];//give the kernel something to do

__global__ void kernel(int i, int device){

	if (blockIdx.x == 0 && threadIdx.x == 0) printf("KERNEL %d STARTED ON DEVICE %d\n", i, device);


	int pass = 0;

	while (1){

		int index = pass * THREADS * BLOCKS + threadIdx.x;

		if (index >= DATA_SIZE) break;

		data[index] = index * 2;



	if (blockIdx.x == 0 && threadIdx.x == 0) printf("KERNEL %d FINISHED ON DEVICE %d\n", i, device);


void compute(){

	int devices = 0;



	cudaError_t error = cudaGetLastError();

	if (cudaSuccess != error){

		std::cout << "Error in kernel = " << cudaGetErrorString(error) << std::endl;


	printf("Detected %d devices\n", devices);

	cudaStream_t* streams = new cudaStream_t[devices * 2];

	int stream_counter = 0;

	for (int i = 0; i < devices; i++){


		for (int j = 0; j < 2; j++){

			cudaStream_t s;


			streams[stream_counter] = s;

			std::cout << "Device " << i << ": Created stream " << s << " at index " << stream_counter << std::endl;





	kernel<<<BLOCKS, THREADS, 0, streams[0]>>>(0, 0);

	kernel<<<BLOCKS, THREADS, 0, streams[1]>>>(1, 0);


	kernel<<<BLOCKS, THREADS, 0, streams[2]>>>(0, 1);

	kernel<<<BLOCKS, THREADS, 0, streams[3]>>>(1, 1);

	for (int i = 0; i < 4; i++){




what OS, and how do you know that they aren’t running in parallel?

It’s Linux x64 (CentOS 5.5).
First, I use printf statements at the beginning and end of the kernel (as in the attached source). With parallel execution (say, two streams on the same device), I get two start statements followed by two end flags. With 2+ devices, until device 0 finishes, device 1 doesn’t start.
Also, the execution time. Even the Nvidia example (simpleMultiGPU) takes the same time on 6 Teslas as it does on a single GTX480. No speedup whatsoever. Which, along with print statements, makes for a pretty convincing serial execution case…

printf ordering is not a guarantee of anything except when the driver determines when to empty the printf buffer. it has no relation to whether or not something is actually running.

the SDK samples are not intended to be benchmarks and are not necessarily indicative of anything performance-wise–that example is almost certainly dominated by startup time, for example.

I’d say just do all the timing yourself–time all the cudaStreamSynchronize calls, then launch everything in such a way that you’d expect it to be serial and time those.

Did you use nvidia-smi to see if they are running concurrently??

If you want some COTS software to do all the multi-GPU synchronization for you robustly, you might check out ArrayFire Pro. Manually coordinating a bunch of streams can be complicated.

tmurray, you were right. I guess, the startup overhead ate up most of the performance benefit in the Nvidia example. I got it to work with my own code and more computations per kernel launch. There were a few issues with copying results back to the host.

ymc, the last check by nvidia-smi gave full load on all cards.

I’ll definitely take a look at the ArrayFire, since I expect the setup to become even more complicated than it is now.

Thanks for the tips