Thank you njuffa, it is an interesting paper.
However, the paper say actually that in Caffe implementation there are a lot of GPUs communication, which I am not seeing in my experiments. Or at least, the majority of the data transfer is Host-to-Device.
One of the reason might be that the paper uses application over MPI, and I am using Caffe in only one machine without MPI.
For further information about my application:
==29838== Profiling application: ./build/tools/caffe train --solver=models/bvlc_reference_caffenet/solver.prototxt -gpu 0,1
==29838== Profiling result:
Time(%) Time Calls Avg Min Max Name
71.57% 34.6426s 129456 267.60us 228.86us 481.62us sgemm_sm35_ldg_nn_64x16x64x16x16
11.30% 5.46827s 80928 67.569us 29.600us 141.08us void caffe::im2col_gpu_kernel(int, float const , int, int, int, int, int, int, int, int, int, int, int, int, caffe::im2col_gpu_kernel)
5.25% 2.54033s 969 2.6216ms 830.44us 5.4433ms sgemm_sm35_ldg_tn_32x16x64x8x16
3.31% 1.60216s 3628 441.61us 1.3120us 22.329ms [CUDA memcpy HtoD]
2.15% 1.03878s 81897 12.684us 7.1680us 26.975us void gemmk1_kernel<float, int=256, int=5, bool=0, bool=0, bool=0, bool=0>(cublasGemmk1Params, float const , float const , float)
1.99% 961.20ms 971 989.91us 266.42us 1.5735ms void caffe::MaxPoolForward(int, float const , int, int, int, int, int, int, int, int, int, int, int, int, caffe::MaxPoolForward, int, float const )
1.54% 746.81ms 2264 329.86us 13.823us 1.0193ms void caffe::ReLUForward(int, float const , caffe::ReLUForward, caffe::ReLUForward)
0.57% 276.63ms 648 426.89us 326.39us 528.98us void caffe::LRNComputeOutput(int, float const , float const , caffe::LRNComputeOutput, caffe::LRNComputeOutput)
0.56% 272.58ms 323 843.90us 843.50us 844.81us void caffe::kernel_channel_max(int, int, int, float const , caffe::kernel_channel_max)
0.56% 272.57ms 323 843.86us 843.18us 857.86us void caffe::kernel_channel_sum(int, int, int, float const , caffe::kernel_channel_sum)
0.49% 237.80ms 648 366.98us 321.98us 453.97us void caffe::LRNFillScale(int, float const , int, int, int, int, int, caffe::LRNFillScale, caffe::LRNFillScale, caffe::LRNFillScale)
0.36% 175.70ms 1649 106.55us 2.9760us 19.898ms [CUDA memcpy DtoH]
0.29% 142.69ms 971 146.96us 2.3360us 438.00us [CUDA memcpy DtoD]
0.01% 4.2295ms 62 68.217us 1.1840us 1.5499ms [CUDA memset]
0.01% 4.0577ms 646 6.2810us 3.9040us 9.1510us void dot_kernel<float, float, float, int=128, int=0, int=0>(cublasDotParams<float, float>)
0.01% 3.6119ms 323 11.182us 10.752us 12.703us void caffe::kernel_channel_div(int, int, int, int, float const , caffe::kernel_channel_div)
0.01% 3.3833ms 646 5.2370us 4.6710us 6.1760us void asum_kernel<float, float, int=0>(cublasAsumParams<float, float>)
0.01% 3.1036ms 323 9.6080us 9.2800us 9.8230us void caffe::kernel_channel_subtract(int, int, int, int, float const , caffe::kernel_channel_subtract)
0.01% 2.5553ms 646 3.9550us 2.9760us 5.1840us void reduce_1Block_kernel<float, float, float, int=128, int=7>(float, int, float*)
0.00% 2.2668ms 323 7.0180us 6.8790us 8.4480us void caffe::kernel_exp(int, float const , caffe::kernel_exp)
0.00% 1.9116ms 323 5.9180us 5.8230us 6.1760us void caffe::SoftmaxLossForwardGPU(int, float const , float const , caffe::SoftmaxLossForwardGPU, int, int, int, bool, int, float const *)
==29838== API calls:
Time(%) Time Calls Avg Min Max Name
54.92% 41.1956s 4946 8.3291ms 12.313us 28.897ms cudaMemcpy
16.96% 12.7241s 23 553.22ms 1.5630us 12.3990s cudaFree
16.39% 12.2905s 86 142.91ms 8.6410us 12.2668s cudaMalloc
4.99% 3.73945s 301657 12.396us 9.6850us 27.883ms cudaLaunch
2.28% 1.70803s 3417287 499ns 373ns 9.5152ms cudaSetupArgument
1.89% 1.41735s 1626 871.68us 4.4300us 22.349ms cudaStreamSynchronize
0.81% 609.10ms 1302 467.81us 21.051us 28.922ms cudaMemcpyAsync
0.77% 578.33ms 1348 429.03us 1.5000us 28.901ms cudaEventDestroy
0.43% 319.13ms 498 640.81us 717ns 32.000ms cudaMallocHost
0.24% 181.31ms 301658 601ns 401ns 92.971us cudaConfigureCall
0.15% 113.50ms 215552 526ns 369ns 213.86us cudaGetLastError
0.06% 46.345ms 85459 542ns 395ns 67.213us cudaPeekAtLastError
0.03% 23.014ms 1332 17.277us 1.5550us 9.4235ms cudaEventCreate
0.02% 14.048ms 16 878.02us 47.450us 6.2861ms cudaFreeHost
0.02% 12.400ms 12 1.0333ms 739.84us 1.8946ms cudaGetDeviceProperties
0.01% 9.5804ms 1060 9.0380us 242ns 337.73us cuDeviceGetAttribute
0.01% 6.8362ms 969 7.0540us 5.3460us 192.72us cudaFuncGetAttributes
0.01% 6.7220ms 12 560.16us 553.32us 569.49us cuDeviceTotalMem
0.01% 4.2484ms 969 4.3840us 3.2600us 159.22us cudaEventQuery
0.00% 2.7236ms 1 2.7236ms 2.7236ms 2.7236ms cudaDeviceEnablePeerAccess
0.00% 2.4173ms 969 2.4940us 1.8120us 50.789us cudaEventRecord
0.00% 1.2061ms 62 19.453us 12.218us 83.530us cudaMemset
0.00% 843.32us 128 6.5880us 1.1760us 316.44us cudaEventCreateWithFlags
0.00% 770.11us 12 64.175us 58.939us 72.246us cuDeviceGetName
0.00% 242.33us 3 80.777us 46.544us 132.24us cudaStreamCreateWithFlags
0.00% 162.79us 85 1.9150us 815ns 45.469us cudaGetDevice
0.00% 94.456us 88 1.0730us 600ns 2.8990us cudaDeviceGetAttribute
0.00% 91.937us 17 5.4080us 990ns 18.910us cudaSetDevice
0.00% 25.502us 1 25.502us 25.502us 25.502us cudaDeviceCanAccessPeer
0.00% 18.858us 2 9.4290us 8.6680us 10.190us cudaThreadSynchronize
0.00% 9.4160us 20 470ns 266ns 2.0450us cuDeviceGet
0.00% 6.1370us 5 1.2270us 357ns 4.2030us cuDeviceGetCount
0.00% 2.3950us 2 1.1970us 907ns 1.4880us cuInit
0.00% 1.2410us 2 620ns 439ns 802ns cuDriverGetVersion