multiple GPU initialization is slow with openMPI on cluster

I have a program which utilize multiple GPUs (from 2-8 GPUs) and OpenMPI, however, I find that the initilization time is related with the number of GPUs. Here is the time calculation by MPI-Wtime() at the begining of C++ class construction(including some cudaMalloc and cudaMemset functions ) and at the end of construction. The time shows as follows(different GPU numbers):

2 GPUs:
GPU 0: start 1490014320.378046, end 1490014321.272212
GPU 1: start 1490014320.378051, end 1490014321.272682

4 GPUs:
GPU 0: start 1490019627.263328, end 1490019628.883257
GPU 1: start 1490019627.263316, end 1490019628.897836
GPU 2: start 1490019627.263319, end 1490019628.908624
GPU 3: start 1490019627.263327, end 1490019628.917646

6 GPUs:
GPU 0: start 1490020885.638772, end 1490020888.309326
GPU 1: start 1490020885.638832, end 1490020888.310065
GPU 2: start 1490020885.638828, end 1490020888.309444
GPU 3: start 1490020885.638832, end 1490020888.309797
GPU 4: start 1490020885.638702, end 1490020888.309907
GPU 5: start 1490020885.638772, end 1490020888.310181

8 GPUs:
GPU 0: start 1490029784.260837, end 1490029788.039831
GPU 1: start 1490029784.260856, end 1490029788.039424
GPU 2: start 1490029784.260854, end 1490029788.040519
GPU 3: start 1490029784.260851, end 1490029788.039787
GPU 4: start 1490029784.260858, end 1490029788.040336
GPU 5: start 1490029784.260879, end 1490029788.039929
GPU 6: start 1490029784.260881, end 1490029788.040165
GPU 7: start 1490029784.260939, end 1490029788.040401

From this results we can find that
2GPUs: the total time is around 0.8s
4GPUs: the total time is around 1.6s
6GPUs: the total time is around 2.7s
8GPUs: the total time is around 3.8s

I use the exactly same code for different GPU numbers. CUDA 7.5 and MPI 1.8.5 are used. I’m very curious about this situation since my program only runs for 10 -20 seconds and this time occupies 1/3 total time in some cases.
Moreover,when I use my PC (1 GPUs), the initiliztion time is quite short(around 0.01s), why the the GPUs of NCI needs so long time to initilization?
Thanks