Caffe on Single-GPU is faster than on Multi-GPU with small batch size

I followed this site: http://www.nvidia.com/object/caffe-installation.html for setting up caffe running with 2 GPUs (P-100). My installation was successful and Caffe ran on both of my GPUs. I quickly ran MNIST example with single-GPU and double-GPU.

The single GPU ran faster and operated more images than the double GPU with small batch size as train batch size: 64 and test batch size: 100 (default). I did not like this result. So, I increased the batch sizes as train batch: 512 and test batch: 800. The double GPU ran slightly faster and operated more images than the single GPU with the large batch sizes.

I think comparison between single and multi GPU with MNIST is not good example because of its small dataset. It is not necessary for me to even train the model with 2-GPU. But, I like to see my double-GPU run faster than single-GPU. So, I increased the batch size to make sure my double-GPU gets better performance than single GPU. My double-GPU was better than single-GPU. Do you think my multi-GPU caffe running correctly?

Here is the small batch.
1 GPU with train batch size 64, test batch size 100:
I0531 18:55:08.205492 2381 solver.cpp:418] Iteration 10000, loss = 0.0027109
I0531 18:55:08.205526 2381 solver.cpp:441] Iteration 10000, Testing net (#0)
I0531 18:55:08.285313 2421 data_reader.cpp:128] Restarting data pre-fetching
I0531 18:55:08.287940 2381 solver.cpp:526] Test net output #0: accuracy = 0.991
I0531 18:55:08.287976 2381 solver.cpp:526] Test net output #1: loss = 0.0278788 (* 1 = 0.0278788 loss)
I0531 18:55:08.288008 2381 caffe.cpp:231] Solver performance on device 0: 484 * 64 = 3.098e+04 img/sec
I0531 18:55:08.288030 2381 caffe.cpp:234] Optimization Done in 22s

real 0m22.811s
user 0m28.461s
sys 0m7.427s

2 GPUs with train batch size 64, test batch size 100:
I0531 18:54:08.060204 2355 solver.cpp:418] Iteration 10000, loss = 0.00512686
I0531 18:54:08.060246 2355 solver.cpp:441] Iteration 10000, Testing net (#0)
I0531 18:54:08.134591 2353 data_reader.cpp:128] Restarting data pre-fetching
I0531 18:54:08.137965 2355 solver.cpp:526] Test net output #0: accuracy = 0.9896
I0531 18:54:08.138015 2355 solver.cpp:526] Test net output #1: loss = 0.0330716 (* 1 = 0.0330716 loss)
I0531 18:54:08.142132 2313 parallel.cpp:77] Root Solver performance on device 0: 406.3 * 32 = 1.3e+04 img/sec
I0531 18:54:08.142172 2313 parallel.cpp:82] Solver performance on device 1: 406.1 * 32 = 1.3e+04 img/sec
I0531 18:54:08.142192 2313 parallel.cpp:85] Overall multi-GPU performance: 25997.6 img/sec
I0531 18:54:08.174340 2313 caffe.cpp:234] Optimization Done in 27s

real 0m28.253s
user 0m58.389s
sys 0m14.847s

This large batch size:
1 GPU with train batch size 512, test batch size 800:
I0531 19:09:27.588884 2709 solver.cpp:418] Iteration 10000, loss = 0.00504461
I0531 19:09:27.588932 2709 solver.cpp:441] Iteration 10000, Testing net (#0)
I0531 19:09:27.628216 2748 data_reader.cpp:128] Restarting data pre-fetching
I0531 19:09:27.673048 2748 data_reader.cpp:128] Restarting data pre-fetching
I0531 19:09:27.716518 2748 data_reader.cpp:128] Restarting data pre-fetching
I0531 19:09:27.749955 2748 data_reader.cpp:128] Restarting data pre-fetching
I0531 19:09:27.782929 2748 data_reader.cpp:128] Restarting data pre-fetching
I0531 19:09:27.815544 2748 data_reader.cpp:128] Restarting data pre-fetching
I0531 19:09:27.848457 2748 data_reader.cpp:128] Restarting data pre-fetching
I0531 19:09:27.886387 2748 data_reader.cpp:128] Restarting data pre-fetching
I0531 19:09:27.894071 2709 solver.cpp:526] Test net output #0: accuracy = 0.9901
I0531 19:09:27.894125 2709 solver.cpp:526] Test net output #1: loss = 0.0304051 (* 1 = 0.0304051 loss)
I0531 19:09:27.894165 2709 caffe.cpp:231] Solver performance on device 0: 182.6 * 512 = 9.348e+04 img/sec
I0531 19:09:27.894201 2709 caffe.cpp:234] Optimization Done in 56s

real 0m57.042s
user 1m28.784s
sys 0m17.242s

2 GPUs with train batch size 512, test batch size 800:
I0531 19:10:36.834173 2805 solver.cpp:418] Iteration 10000, loss = 0.00573948
I0531 19:10:36.834216 2805 solver.cpp:441] Iteration 10000, Testing net (#0)
I0531 19:10:36.855373 2803 data_reader.cpp:128] Restarting data pre-fetching
I0531 19:10:36.876559 2803 data_reader.cpp:128] Restarting data pre-fetching
I0531 19:10:36.899631 2803 data_reader.cpp:128] Restarting data pre-fetching
I0531 19:10:36.921191 2803 data_reader.cpp:128] Restarting data pre-fetching
I0531 19:10:36.944612 2803 data_reader.cpp:128] Restarting data pre-fetching
I0531 19:10:36.962541 2803 data_reader.cpp:128] Restarting data pre-fetching
I0531 19:10:36.980353 2803 data_reader.cpp:128] Restarting data pre-fetching
I0531 19:10:36.997534 2803 data_reader.cpp:128] Restarting data pre-fetching
I0531 19:10:37.001943 2805 solver.cpp:526] Test net output #0: accuracy = 0.9904
I0531 19:10:37.002003 2805 solver.cpp:526] Test net output #1: loss = 0.0319385 (* 1 = 0.0319385 loss)
I0531 19:10:37.007083 2762 parallel.cpp:77] Root Solver performance on device 0: 250.7 * 256 = 6.418e+04 img/sec
I0531 19:10:37.007123 2762 parallel.cpp:82] Solver performance on device 1: 250.6 * 256 = 6.416e+04 img/sec
I0531 19:10:37.007136 2762 parallel.cpp:85] Overall multi-GPU performance: 128343 img/sec
I0531 19:10:37.042526 2762 caffe.cpp:234] Optimization Done in 43s

real 0m44.052s
user 1m57.430s
sys 0m23.410s