Caffe on Single-GPU is faster than on Multi-GPU with small batch size

Peita · June 1, 2017, 12:11am

I followed this site: http://www.nvidia.com/object/caffe-installation.html for setting up caffe running with 2 GPUs (P-100). My installation was successful and Caffe ran on both of my GPUs. I quickly ran MNIST example with single-GPU and double-GPU.

The single GPU ran faster and operated more images than the double GPU with small batch size as train batch size: 64 and test batch size: 100 (default). I did not like this result. So, I increased the batch sizes as train batch: 512 and test batch: 800. The double GPU ran slightly faster and operated more images than the single GPU with the large batch sizes.

I think comparison between single and multi GPU with MNIST is not good example because of its small dataset. It is not necessary for me to even train the model with 2-GPU. But, I like to see my double-GPU run faster than single-GPU. So, I increased the batch size to make sure my double-GPU gets better performance than single GPU. My double-GPU was better than single-GPU. Do you think my multi-GPU caffe running correctly?

Here is the small batch.
1 GPU with train batch size 64, test batch size 100:
I0531 18:55:08.205492 2381 solver.cpp:418] Iteration 10000, loss = 0.0027109
I0531 18:55:08.205526 2381 solver.cpp:441] Iteration 10000, Testing net (#0)
I0531 18:55:08.285313 2421 data_reader.cpp:128] Restarting data pre-fetching
I0531 18:55:08.287940 2381 solver.cpp:526] Test net output #0: accuracy = 0.991
I0531 18:55:08.287976 2381 solver.cpp:526] Test net output #1: loss = 0.0278788 (* 1 = 0.0278788 loss)
I0531 18:55:08.288008 2381 caffe.cpp:231] Solver performance on device 0: 484 * 64 = 3.098e+04 img/sec
I0531 18:55:08.288030 2381 caffe.cpp:234] Optimization Done in 22s

real 0m22.811s
user 0m28.461s
sys 0m7.427s

2 GPUs with train batch size 64, test batch size 100:
I0531 18:54:08.060204 2355 solver.cpp:418] Iteration 10000, loss = 0.00512686
I0531 18:54:08.060246 2355 solver.cpp:441] Iteration 10000, Testing net (#0)
I0531 18:54:08.134591 2353 data_reader.cpp:128] Restarting data pre-fetching
I0531 18:54:08.137965 2355 solver.cpp:526] Test net output #0: accuracy = 0.9896
I0531 18:54:08.138015 2355 solver.cpp:526] Test net output #1: loss = 0.0330716 (* 1 = 0.0330716 loss)
I0531 18:54:08.142132 2313 parallel.cpp:77] Root Solver performance on device 0: 406.3 * 32 = 1.3e+04 img/sec
I0531 18:54:08.142172 2313 parallel.cpp:82] Solver performance on device 1: 406.1 * 32 = 1.3e+04 img/sec
I0531 18:54:08.142192 2313 parallel.cpp:85] Overall multi-GPU performance: 25997.6 img/sec
I0531 18:54:08.174340 2313 caffe.cpp:234] Optimization Done in 27s

real 0m28.253s
user 0m58.389s
sys 0m14.847s

This large batch size:
1 GPU with train batch size 512, test batch size 800:
I0531 19:09:27.588884 2709 solver.cpp:418] Iteration 10000, loss = 0.00504461
I0531 19:09:27.588932 2709 solver.cpp:441] Iteration 10000, Testing net (#0)
I0531 19:09:27.628216 2748 data_reader.cpp:128] Restarting data pre-fetching
I0531 19:09:27.673048 2748 data_reader.cpp:128] Restarting data pre-fetching
I0531 19:09:27.716518 2748 data_reader.cpp:128] Restarting data pre-fetching
I0531 19:09:27.749955 2748 data_reader.cpp:128] Restarting data pre-fetching
I0531 19:09:27.782929 2748 data_reader.cpp:128] Restarting data pre-fetching
I0531 19:09:27.815544 2748 data_reader.cpp:128] Restarting data pre-fetching
I0531 19:09:27.848457 2748 data_reader.cpp:128] Restarting data pre-fetching
I0531 19:09:27.886387 2748 data_reader.cpp:128] Restarting data pre-fetching
I0531 19:09:27.894071 2709 solver.cpp:526] Test net output #0: accuracy = 0.9901
I0531 19:09:27.894125 2709 solver.cpp:526] Test net output #1: loss = 0.0304051 (* 1 = 0.0304051 loss)
I0531 19:09:27.894165 2709 caffe.cpp:231] Solver performance on device 0: 182.6 * 512 = 9.348e+04 img/sec
I0531 19:09:27.894201 2709 caffe.cpp:234] Optimization Done in 56s

real 0m57.042s
user 1m28.784s
sys 0m17.242s

2 GPUs with train batch size 512, test batch size 800:
I0531 19:10:36.834173 2805 solver.cpp:418] Iteration 10000, loss = 0.00573948
I0531 19:10:36.834216 2805 solver.cpp:441] Iteration 10000, Testing net (#0)
I0531 19:10:36.855373 2803 data_reader.cpp:128] Restarting data pre-fetching
I0531 19:10:36.876559 2803 data_reader.cpp:128] Restarting data pre-fetching
I0531 19:10:36.899631 2803 data_reader.cpp:128] Restarting data pre-fetching
I0531 19:10:36.921191 2803 data_reader.cpp:128] Restarting data pre-fetching
I0531 19:10:36.944612 2803 data_reader.cpp:128] Restarting data pre-fetching
I0531 19:10:36.962541 2803 data_reader.cpp:128] Restarting data pre-fetching
I0531 19:10:36.980353 2803 data_reader.cpp:128] Restarting data pre-fetching
I0531 19:10:36.997534 2803 data_reader.cpp:128] Restarting data pre-fetching
I0531 19:10:37.001943 2805 solver.cpp:526] Test net output #0: accuracy = 0.9904
I0531 19:10:37.002003 2805 solver.cpp:526] Test net output #1: loss = 0.0319385 (* 1 = 0.0319385 loss)
I0531 19:10:37.007083 2762 parallel.cpp:77] Root Solver performance on device 0: 250.7 * 256 = 6.418e+04 img/sec
I0531 19:10:37.007123 2762 parallel.cpp:82] Solver performance on device 1: 250.6 * 256 = 6.416e+04 img/sec
I0531 19:10:37.007136 2762 parallel.cpp:85] Overall multi-GPU performance: 128343 img/sec
I0531 19:10:37.042526 2762 caffe.cpp:234] Optimization Done in 43s

real 0m44.052s
user 1m57.430s
sys 0m23.410s

Topic		Replies	Views
gpu performance tester weird gpu results CUDA Programming and Performance	8	3488	December 21, 2007
TensorRT unnecessary synchronization in multi-GPU system TensorRT tensorrt , performance , synchronization	7	1411	January 23, 2023
GPU performance suddenly drops down twice during learning CUDA Programming and Performance	11	3536	November 10, 2018
multiGPU poor performance up to 10x lowest performance in multiGPU CUDA Programming and Performance	14	10766	January 18, 2008
Weird multiGPU performance About 10 times slower than single GPU CUDA Programming and Performance	10	3917	November 25, 2009
Confused about GPU vs CPU speed in multiplication CUDA Programming and Performance	8	6553	February 19, 2009
Is GPU worth it? GPU currently too slow. CUDA Programming and Performance	16	6042	December 8, 2008
Intel paper: Debunking the 100X GPU vs. CPU myth CUDA Programming and Performance	36	25229	April 7, 2011
GPU/CPU precision comparison and Kernel instructions question CUDA Programming and Performance	5	680	April 4, 2017
Challenges in Achieving Optimal GPU Performance for FFT on NVIDIA Jetson AGX Orin Jetson AGX Orin gpu-computing	5	309	August 28, 2024

Caffe on Single-GPU is faster than on Multi-GPU with small batch size

Related topics