Problem with multiple GPUs The multiple GPUs are not working in parallel

There are two kernels in my program, with the second one more complicated than the first one. I am trying to use 3 GPUs in parallel execution. From the results, the first kernal runs normally, with the total time slightly longer than each individual GPU. However, the total execution time of the second kernel is longer than the sum of all 3 GPUs. Can anyone help explain what may cause the problem? Here are the results:

First Kernel…

v = 224.06217957
Elapsed time (without Greeks): 0.140319 sec

Profiling Information for GPU Processing:

Device 0 : Tesla T10 Processor
Reduce Kernel : 0.13262 s

Device 1 : Tesla T10 Processor
Reduce Kernel : 0.13319 s

Device 2 : Tesla T10 Processor
Reduce Kernel : 0.13462 s

Second Kernel…

v = 224.06217957
Lb = 21.34053040
Elapsed time (with Greeks): 1.558295 sec

Profiling Information for GPU Processing:

Device 0 : Tesla T10 Processor
Reduce Kernel : 0.51363 s

Device 1 : Tesla T10 Processor
Reduce Kernel : 0.51602 s

Device 2 : Tesla T10 Processor
Reduce Kernel : 0.51722 s

By the way, I am using Tesla C1060 and I modified my program according to the sample program “simpleMultiGPU” provided in Nvidia OpenCL SDK sample codes. I am working under Linux. Thanks,

Do the devices share a single context or do they use separate ones?

Do the devices share a single context or do they use separate ones?

They share a single context. The problem is my first kernel doesn’t have any problem, but for the second one, the GPUs seems not running in parallel.

They share a single context. The problem is my first kernel doesn’t have any problem, but for the second one, the GPUs seems not running in parallel.

http://forums.nvidia.com/index.php?showtopic=176628

http://forums.nvidia.com/index.php?showtopic=176628