WaitForMultipleObjects takes too much time

Hello all,
I am trying to use 2 GPUs to multiply two vectors (I wanted to learn how to use muliple GPUs). I used the simpleMultiGPU project prepared by NVIDIA and changed it to multiply two vectors. I found that the command of WaitForMultipleObjects takes 2800ms to run while the whole GPU commands takes just 1ms! I attached the project in a zipped file here. Could anybody please take a look to this files and let me know what is the problem with that.
I should explain that when I run the original simpleMultiGPU project the whole computation does’t take more than 200ms, I don’t know why it takes too much when I changed it.
I am using two GTX 280, and Windows XP.
Any help or suggestion is really appreciated!
Thank you.
cppIntegration_multiGPU_cleaned.rar (2.16 MB)

Same problem here. Have you solve the problem of WaitForMultipleObjects?