I have the following problem. I’ve been trying to use multiple GPUs for computing things concurrently with both CUDA and OpenCL on my GTX 590 (dual-GPU). The SDK example code for simpleMultiGPU runs ok, but when I increase the kernel runtime (put an outer for loop for some big number inside the kernel) and tweak the GPU_N parameter inside the main program to use a single GPU and then both GPUs, it turns out that both take the same execution time. This essentially means that they run sequentially, not concurrently. The exact same happens for the CUDA and the OpenCL sample. Here is the OpenCL output with profiling enabled:
# time oclSimpleMultiGPU [oclSimpleMultiGPU] starting... oclSimpleMultiGPU Starting, Array = 25165824 float values... Setting up OpenCL on the Host... OpenCL Profiling is enabled... clGetPlatformID... clGetDeviceIDs... clCreateContext... Device 0: GeForce GTX 590 clCreateCommandQueue Device 1: GeForce GTX 590 clCreateCommandQueue oclLoadProgSource clCreateProgramWithSource clBuildProgram clCreateBuffer (Page-locked Host) clCreateBuffer (Input) Dev 0 clEnqueueCopyBuffer (Input) Dev 0 clCreateBuffer (Output) Dev 0 clCreateKernel Dev 0 clSetKernelArg Dev 0 clCreateBuffer (Input) Dev 1 clEnqueueCopyBuffer (Input) Dev 1 clCreateBuffer (Output) Dev 1 clCreateKernel Dev 1 clSetKernelArg Dev 1 Launching Kernels on GPU(s)... clWaitForEvents complete... Profiling Information for GPU Processing: Device 0 : GeForce GTX 590 Reduce Kernel : 21.75045 s Copy Device->Host : 0.00001 s Device 1 : GeForce GTX 590 Reduce Kernel : 21.79162 s Copy Device->Host : 0.00001 s Launching Host/CPU C++ Computation... Comparing against Host/C++ computation... GPU sum: 12582409.571259 CPU sum: 12582409.524807 Relative Error 100.0 * Error / Golden) = 0.000000 real 0m44.528s user 0m28.010s sys 0m16.397s
You can see, the real time of the example is more than twice the time of one kernel run, meaning that both kernels run in series. (The host computation takes a fraction of a second)
I tried the same with CUDA 3.2 on a dual Tesla M2090 system. There the CUDA code from the example actually runs concurrently (2 GPUs is much faster than 1 GPU), but not for OpenCL. The OpenCL code runs for the same time no matter if 1 or 2 GPUs are used.
I wonder if I need to do some configuration somewhere in the driver or is there something else I am missing? So the questions boil down to:
How to use two GPUs concurrently on CUDA and OpenCL on a GTX 590 with the simpleMultiGPU sample code?
Why does the CUDA sample work concurrently on a dual Tesla M2090 system, but not the corresponding OpenCL sample?
OpenSUSE 11.4 64bit
NVIDIA driver 270.41.19
CUDA SDK 3.2 (Tesla system) and 4.0 (GTX 590 system)