5 times speed up when using 4 GPUs?

I am developing software using Tesla C2075 in a single workstation for a embarrassingly parallel problem. Assuming that in a Ubuntu 12.04 workstation with 1 C2075 (CUDA 4.2), the program runs in 4n seconds then in a CentOS 5 workstation with 4 C2075 (CUDA 4.0), using

  • 4 C2075 needs ~n seconds
  • 1 C2075 needs ~5n seconds

It seems strange to me that

  • we have 5 times speed up when we use 4 GPUs
  • the required time for 1 C2075 in CentOS is not the same as in Ubuntu

Is there any problem with the CUDA driver or operating system here? Thanks

OS + driver + CUDA version can all play a role here. Very hard to discern which one without a process of elimination.