CUDA/OpenCL runs multiple GPUs sequentially


I have the following problem. I’ve been trying to use multiple GPUs for computing things concurrently with both CUDA and OpenCL on my GTX 590 (dual-GPU). The SDK example code for simpleMultiGPU runs ok, but when I increase the kernel runtime (put an outer for loop for some big number inside the kernel) and tweak the GPU_N parameter inside the main program to use a single GPU and then both GPUs, it turns out that both take the same execution time. This essentially means that they run sequentially, not concurrently. The exact same happens for the CUDA and the OpenCL sample. Here is the OpenCL output with profiling enabled:

# time oclSimpleMultiGPU 

[oclSimpleMultiGPU] starting...

oclSimpleMultiGPU Starting, Array = 25165824 float values...

Setting up OpenCL on the Host...

OpenCL Profiling is enabled...




 Device 0: GeForce GTX 590


Device 1: GeForce GTX 590





clCreateBuffer (Page-locked Host)

clCreateBuffer (Input)		Dev 0

clEnqueueCopyBuffer (Input)	Dev 0

clCreateBuffer (Output)		Dev 0

clCreateKernel			Dev 0

clSetKernelArg			Dev 0

clCreateBuffer (Input)		Dev 1

clEnqueueCopyBuffer (Input)	Dev 1

clCreateBuffer (Output)		Dev 1

clCreateKernel			Dev 1

clSetKernelArg			Dev 1

Launching Kernels on GPU(s)...

clWaitForEvents complete...

Profiling Information for GPU Processing:

Device 0 : GeForce GTX 590

  Reduce Kernel     : 21.75045 s

  Copy Device->Host : 0.00001 s

Device 1 : GeForce GTX 590

  Reduce Kernel     : 21.79162 s

  Copy Device->Host : 0.00001 s

Launching Host/CPU C++ Computation...

Comparing against Host/C++ computation...

 GPU sum: 12582409.571259

 CPU sum: 12582409.524807

 Relative Error 100.0 * Error / Golden) = 0.000000 

real	0m44.528s

user	0m28.010s

sys	0m16.397s

You can see, the real time of the example is more than twice the time of one kernel run, meaning that both kernels run in series. (The host computation takes a fraction of a second)

I tried the same with CUDA 3.2 on a dual Tesla M2090 system. There the CUDA code from the example actually runs concurrently (2 GPUs is much faster than 1 GPU), but not for OpenCL. The OpenCL code runs for the same time no matter if 1 or 2 GPUs are used.

I wonder if I need to do some configuration somewhere in the driver or is there something else I am missing? So the questions boil down to:

    How to use two GPUs concurrently on CUDA and OpenCL on a GTX 590 with the simpleMultiGPU sample code?

    Why does the CUDA sample work concurrently on a dual Tesla M2090 system, but not the corresponding OpenCL sample?

My configuration:

  • OpenSUSE 11.4 64bit

  • NVIDIA driver 270.41.19

  • CUDA SDK 3.2 (Tesla system) and 4.0 (GTX 590 system)


I am seeing similiar behaviour on a dual GTX-460 Scientific-Linux 6 system. Running a simple (OpenCL) addition of ~800MB of integer values takes around 0.4s on a single Device, but around 0.6s if I try to use two devices concurrently. Each device is controlled via its own thread, using one OpenCL context per device. Have you had any luck yet figuring out this beahviour?

I did not find a solution to that problem yet.

But you are already better than I am - you get less than twice the time. You will never get really 2x speedup, as the memory transfers between host and device share the same PCI bus and are therefore not fully concurrent.

Did you profile the code with the OpenCL built-in profiling? To get the times spent on moving data and the time spent on actual computation? In my case, the moving data of the second GPU is always execution time of the first + moving data time of the first, so the second GPU is definitively waiting for the first to finish before it goes ahead. In your case you seem to have some concurrency at least. Would you mind attaching some of your source code so that I can repeat the test with my GTX 590? Or otherwise, would you be able to test the speed with NVIDIA SDK example for OpenCL multi GPU as described in my previous post, to see if you get similar results?


I can give you my code example, but I begin to think that there might be some issue with the Nvidia OpenCL implementation. Since I have two GTX460 graphics cards (not two GPUs sharing one card) I would also expect the transfer time to be reduced (probably not to 50%, but I’d at least expect around 60-70%). As you suggested, I ran the oclSimpleMultiGPU example (Cuda 4.0 SDK) with GPU_PROFILING enabled and extended kernel runtime. It gave me a similar result as you got and as I got in my demo code: total runtime = 2 times single core runtime. Here is the detailed output:

time ./oclSimpleMultiGPU

[oclSimpleMultiGPU] starting...

./oclSimpleMultiGPU Starting, Array = 25165824 float values...

Setting up OpenCL on the Host...

OpenCL Profiling is enabled...




 Device 0: GeForce GTX 460


Device 1: GeForce GTX 460





clCreateBuffer (Page-locked Host)

clCreateBuffer (Input)          Dev 0

clEnqueueCopyBuffer (Input)     Dev 0

clCreateBuffer (Output)         Dev 0

clCreateKernel                  Dev 0

clSetKernelArg                  Dev 0

clCreateBuffer (Input)          Dev 1

clEnqueueCopyBuffer (Input)     Dev 1

clCreateBuffer (Output)         Dev 1

clCreateKernel                  Dev 1

clSetKernelArg                  Dev 1

Launching Kernels on GPU(s)...

clWaitForEvents complete...

Profiling Information for GPU Processing:

Device 0 : GeForce GTX 460

  Reduce Kernel     : 43.74597 s

  Copy Device->Host : 0.00000 s

Device 1 : GeForce GTX 460

  Reduce Kernel     : 44.22565 s

  Copy Device->Host : 0.00000 s

real    1m29.532s

user    0m39.995s

sys     0m49.214s


I think so too - there is an issue with the NVIDIA OpenCL implementation. They don’t seem to focus on OpenCL at all - they still support only OpenCL 1.0, even though 1.1 is around for a long time now. The have a beta driver for OpenCL 1.1 support if you have a developer account with them, but it is quite old and doesn’t support my graphics card. The GTX 460 is probably not supported either, but you might give it a try.

Another test you could run is the CUDA version of the multi GPU example. You can do the same stuff there. But it doesn’t give you individual counts for the GPUs. So you need to hard-code the NGPU variable in the source to be 1 or 2, run it with both, and compare the runtime. The funny thing is, I get the same issue with CUDA on my GTX 590, but it actually works concurrently on a system with two separate Telsa M2090 cards. Can you give that a try and see what you get?


I ran the CUDA multi GPU example and it actually showed - at least to some degree - parallel execution. The GPU runtime for two devices boiled down to 270ms from 390ms on a single device. The OpenCL multi GPU example however got really strange: the GPU runtime on two devices increased from ~500ms to 1.8s. This really seems to be an issue in the Nvidia OpenCL implementation.
I must say I am really disappointed in the way how Nvidia supports OpenCL, I am often getting strange behaviour (kernels inexplicably stop working only to start working again after some time, profiling information doesn’t turn up in the profiling logs) and Nvidia has now waited for almost a year without releasing a working OCL 1.1 driver. Since switching to CUDA is not an option, I am considering switching to ATI, since their OCL support seems far better.

Thank you all for pointing this out to us. We are looking into this.

Thank you all for pointing this out to us. We are looking into this.

Hi guys, I would appreciate it if you could point me out to some light.
I am hesitating between getting the GTX580 and GTX590. My only concern is the number of cuda cores I could use to do parallel computations using OpenCL. So my question, if I get the GTX590, will I be able to use 2x 512 cuda cores in parallel? Or 1x512 cores only? Is it better for me to get two GTX580 in SLI or one GTX590?

The new NVIDIA drivers fixed this problem. Now all GPUs are running properly in parallel. Just make sure you are using the most recent drivers from NVIDIA and you are fine.

Also keep in mind that the GTX590 is reported as two separate GPUs to OpenCL and CUDA. So from a programming perspective, it is the same as getting two separate cards. Each GPU has separate global memory, etc.


I have 2 kernel and I am trying to run each kernel on different GPU, but the kernels are serialized, should I used different context for each of them or it doesn’t matter? I am using OpenCL.

You shouldn’t need to. Do you use multiple host threads for that? Or non-blocking kernel launches? I tried it with multiple host threads and that worked fine. Non-blocking kernel calls should also work, and you should synchronize both events after to see the full time taken.

No I don’t use multiple host threads, and I think the kernel is non-blocking, please have a look on my code down, and tell me if I did something wrong, the fermi is using driver 290.10, should be updated or its OK?

void setEnviGPU()


	 vector<Platform>  allPlatforms;

	 Platform targetPlatform;


	 if (!(allPlatforms.size() > 0))

	     throw (std::string("InitCL()::Error: No platforms found (cl::Platform::get())"));

	 // Select the target platform. Default: first platform

	 targetPlatform = allPlatforms[GPU_PLATFORM];         // gpu platform

	 // Create an OpenCL context

	 cl_context_properties cprops[3] = { CL_CONTEXT_PLATFORM, (cl_context_properties)targetPlatform(), 0 };

	 Context context = Context(CL_DEVICE_TYPE_GPU, cprops);

	 // Detect OpenCL devices

	  vector<cl::Device> devices = context.getInfo<CL_CONTEXT_DEVICES>();

	 cout << "Number of devices: " << devices.size() << endl;

	 // Create an OpenCL command queue

	 CommandQueue queue = CommandQueue(context, devices[0], CL_QUEUE_PROFILING_ENABLE|CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE);

	CommandQueue queue2 = CommandQueue(context, devices[1], CL_QUEUE_PROFILING_ENABLE|CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE);

	 // Load CL file, build CL program object, create CL kernel object

	 std::string sourceStr = FileToString("");

	 Program::Sources sources(1, std::make_pair(sourceStr.c_str(), sourceStr.length()));

Program program = Program(context, sources);;

Kernel kernel = Kernel(program, "fun1");         // -- test reliability

       Kernel kernel2 = Kernel(program, "fun2");         // -- test schedulabiltiy

// Create OpenCL memory buffers

       //kernel1 buffer ...

       Buffer* bufferA = new Buffer(context,CL_MEM_READ_ONLY,sizeof(cl_int) * col * row);   

       Buffer* bufferT = new Buffer(context,CL_MEM_READ_ONLY,sizeof(cl_float) * col);            

       Buffer* bufferO = new Buffer(context,CL_MEM_WRITE_ONLY,sizeof(cl_int) * row);    

queue.enqueueWriteBuffer(*bufferA, CL_FALSE, 0, sizeof(int) * col * row, a, NULL, NULL);

       queue.enqueueWriteBuffer(*bufferT, CL_FALSE, 0, sizeof(float) * col, T, NULL, NULL);

Buffer* bufferO2 =new  Buffer(context,CL_MEM_WRITE_ONLY,sizeof(cl_int) * col*row);      

queue2.enqueueWriteBuffer(*bufferA, CL_FALSE, 0, sizeof(int) * col * row, a, NULL, NULL);

       queue2.enqueueWriteBuffer(*bufferT, CL_FALSE, 0, sizeof(float) * col, T, NULL, NULL);

//set the kernel1 parameter

       kernel.setArg(0, *bufferA);      

       kernel.setArg(1, *bufferT);     

       kernel.setArg(2, *bufferO);

//-- setup  kernel2 paramters

       kernel2.setArg(0, *bufferA);     

       kernel2.setArg(1, *bufferT);      

       kernel2.setArg(2, *bufferO2);

NDRange globalNDRange(1024);      

       NDRange localNDRange(128);    

NDRange globalNDRange2(1024);         //Total number of work items     

       NDRange localNDRange2(128;         //Work items in each work-group

double time1=0, time2=0;   

       Event event1, event2; //-- to measure the execution time of kernel

queue.enqueueNDRangeKernel(kernel, NDRange(), globalNDRange, localNDRange, NULL, &event1);            

queue2.enqueueNDRangeKernel(kernel2, NDRange(), globalNDRange2, localNDRange2, NULL, &event2);   	 




       cout << "kernel 1 start:   "<<  event1.getProfilingInfo<CL_PROFILING_COMMAND_START>() << endl;

       cout << "kernel 1 end:     "<<  event1.getProfilingInfo<CL_PROFILING_COMMAND_END>()<< endl;

cout << "kernel 2 start:   "<<  event2.getProfilingInfo<CL_PROFILING_COMMAND_START>() << endl;

       cout << "kernel 2 end:     "<<  event2.getProfilingInfo<CL_PROFILING_COMMAND_END>()<< endl;

It should work (in theory), but they way I had it working was by using multiple host threads… I never tried it your way, so don’t know if it works.

Can you please give some hint about using multiple host threads, since I didnt try it before. Thanks.

Well this is an old topic but maybe that means someone has found a solution. I’m running an OpenCL kernel on multiple devices, with one command queue per device. All AMD devices I’ve tested (Opteron, FX, A6, R9 270x, Fury X) work in parallel but on NVidia hardware the kernels only execute sequentially. I’m testing on a dual GTX 980 Ti configuration.

On AMD I am able to call NDEnqueueRangeKernel() sequentially for each device, but it seems like it’s a blocking subroutine on NVidia. I don’t know how to call this function from multiple threads, given that I’m using a single kernel and I would think the calls to setArg() would interfere, or would they? Or do I have to use (and presumably build) multiple kernels?

I’m reading the output data (several MB) from multiple host threads and I’ve tried the blocking/non-blocking versions of clEnqueueReadBuffer(), but to no avail.

Using CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE doesn’t seem to fix anything.

I would share the program, except that I don’t have time at the moment to whittle it down for posting. I guess I’m looking for more general suggestions about NVidia specific behaviors. Note that the kernel runs in parallel and with correct on AMD hardware, CPUs and GPUs, and I’ve read the OpenCL spec so I’m fairly convinced this is an NVidia thing. Both NVidia GPUs are contributing, but they seem to trade work back and forth.

By the way, I think the spec says that buffers may be shared among devices. I haven’t seen this work, at least on AMD, so I’m using entirely different input and output buffers for each device, each a few MB. Haven’t tested buffer sharing on NVidia (obviously have had bigger issues).

Any insight would be appreciated.

P.S. I haven’t hooked up an SLI connector. Could this be the solution? I did not need my CrossFire connector for the AMD GPUs to run in parallel (even the R9 270x and the HD7850).

P.P.S. I’ve tried to use the Visual Profiler but after running the program and selecting all events, etc… there is no timeline or results displayed.

— UPDATE ----

To whom it may concern:

I was able to get two devices working together by creating multiple contexts, one for each device. This should not be necessary as I understand the OpenCL spec, and is verifiably not required on AMD but seems to work in practice on nVidia, except now I seem to run out of memory as there must be significant overhead with each context created…

Hope this helps someone out there.

FYI we’ve observed several bugs that prevented asynch GPU operations from behaving correctly.
On one case NDEnqueueRangeKernel() blocked and in another instance, clEnqueueWriteBuffer blocked when passed non-pinned memory.

Bugs were reported, but I’m not sure if any of these got fixed.

The profiling support is known to have been, with an elegant move, simply discontinued. You can still get some profiling info using the command like profiler and (perhaps with a bit of hacking of the csv outputs) you should even be able to load the traces into nvvp!

This is the sad state of OpenCL on NVIDIA. Sad is even an understatement.