Multiple GPU speed problem

I tried the simplemultipleGPU program provided by CUDA SDK. I found that the speed of using 4 gpus is slower than using only 1 gpu. Dose anybody know why?

That project is just designed to show how to use multiple GPUs, but is not designed to be a good benchmark. I explain more over in this post (though you should probably scan the whole thread):…mp;#entry455131

I have changed the DATA_N to 1048576*128. But the gpu time is:
1 gpu 388ms
4 gpus 1093ms
Do you have a better multiple gpus example?

Huh, there must be a lot of thread creation overhead (are you on Windows, out of curiosity?). I don’t know of any good multiGPU benchmarks, aside from major applications, like HOOMD-Blue:

You can see from the tests that they get everything from no improvement (that’s odd) to double speed with two GPUs.

I am on windows vista. Now I have tried to use the thread pool. The time is:
1 gpu 388ms
4 gpus 528ms
It is faster than before