pthread+CUDA for MultiGPUs

I tested simpleMutliGPU, last week and got a puzzled result. I tested using 8, 4, and 1 GPUs. However, with the increase of numbers of GPUs, the execution time also increase. The following are test result:
• 8 GPUs Time: 15166 ms, GPU sum: 16777304
• 4 GPUs Time: 7974 ms, GPU sum: 16777304
• 1 GPU Time: 2236 ms, GPU sum: 26777296.
For the code, I changed nothing besides set MAX_GPU_COUNT = 8, 4, and 1, respectively.

My system is two S2050 (8 GPUs), one node, and Pthread+ CUDA, OS Linux, GCC 4.1.3.

The problem sounds that GPUs run serially instead of parallizisim. Why?

Those runtimes (even with 1 GPU) seem extremely long. I think you are observing the overhead of setting up a CUDA context on a device that is idle.

Can you try the suggestion in this thread to run nvidia-smi in the background to speed up GPU initialization?:

Even once that is fixed, simpleMultiGPU is not a good multi-GPU benchmark as the overhead of host thread creation tends to dominate the runtime.

Thank you for your answer. The command nvidia-smi -l --interval=59 -f /var/log/nvidia-smi.log &

just type on terminal once or need to write into a file? if so, which file?

For testing, you can just run it in a separate terminal. As a long term solution, you might want to have it run automatically at startup. The best mechanism for this depends on your Linux distribution and your interest in writing init scripts or just hacking rc.local. :)

Hi Thanks a lot. As you tlod, the simpleMultiGPus is not a good program to test multiGPUs. Do you have a benchmark to test? I have done according your suggestions. The total time is half of previous one. However, the time is still increasing with increase of number of GPUs used:(