Multi GPU results in latencies in Linux

I am currently testing a software that does work with several CUDA GPUs.
The code contains supports both Linux and Windows.
What has been baffling me is the contrary results between Linux and Win.

I ran Cuda-Profiler on both systems (uploaded in the attachments).
We can see that in Windows it runs just as expected (the second CUDA is slower because it’s a slower card):
each thread calls are nicely packed together and thus increases the overall efficiency as expected.

However, in Linux, the GPUs seem to have a hard time running threads in parallel:

  1. They are executed in serial, such as the first three blocks
  2. They are called in parallel but the are spread out and overall calculation time of the threads are about the same as in serial, such as the six blocks.

Has anyone else experienced a similar problem? If so, is there anything I could do about this?

Caffeine 2 GPU Windows CudaProf Output.jpg

Caffeine 3 GPU Linux CudaProf Output.jpg

I have a multi GPU code working on linux. We used the profiler in command line to reconstruct the time line and everything was fine including mem copy and compute kernel overlapping. After that i used the visual profiler and notice exactly your problem: the launch are serialized and the profiler say that is no memory transfer and kernel overlapping. At this point i suppose that is a bug in the visual profiler.

Can you update to 4.1 and try using nvvp? NVVP will also show you the api calls which might reveal where the extra synchronization is coming from.

I made i try with 4.1RC2 and that solve my kernel and memory transfer overlapping issue: now the profiler shows overlapped transfers and kernel as expected.
By the way i have another issue: “Event/metric collect failed” kernels behaving differently between runs …
I will open another post for that.

I’m facing something similar.

On my single gpu system, cuda 4.1, 295.20 driver, gcc 446, the driver call, memcopy, etc, has time 0.070 seconds.

On my multigpy system, cuda 4.2, 295.45, gcc 4.5.3, 1,070 second!!

Does anybody know is there an driver issue or something like that in