There are two kernels in my program, with the second one more complicated than the first one. I am trying to use 3 GPUs in parallel execution. From the results, the first kernal runs normally, with the total time slightly longer than each individual GPU. However, the total execution time of the second kernel is longer than the sum of all 3 GPUs. Can anyone help explain what may cause the problem? Here are the results:
First Kernel…
v = 224.06217957
Elapsed time (without Greeks): 0.140319 sec
Profiling Information for GPU Processing:
Device 0 : Tesla T10 Processor
Reduce Kernel : 0.13262 s
Device 1 : Tesla T10 Processor
Reduce Kernel : 0.13319 s
Device 2 : Tesla T10 Processor
Reduce Kernel : 0.13462 s
Second Kernel…
v = 224.06217957
Lb = 21.34053040
Elapsed time (with Greeks): 1.558295 sec
Profiling Information for GPU Processing:
Device 0 : Tesla T10 Processor
Reduce Kernel : 0.51363 s
Device 1 : Tesla T10 Processor
Reduce Kernel : 0.51602 s
Device 2 : Tesla T10 Processor
Reduce Kernel : 0.51722 s
By the way, I am using Tesla C1060 and I modified my program according to the sample program “simpleMultiGPU” provided in Nvidia OpenCL SDK sample codes. I am working under Linux. Thanks,