P2P communication from Multi-GPUs for real applications

Hi All,

I am investigating the PCI/nvlink topology impact on Deep Learning Applications on Power8 machines.

From the literature and via the nvidia samples simpleP2P or p2pBandwidthLatencyTest, it is possible to verify the bandwidth and latency difference between intra and inter GPUs communication. However, i want to investigate it via real applications.

As far as I found, Caffe Alexnet as oposity to what I was expecting benefit from be over different processor sockets, since the bottleneck is actually in the HtoD and DtoH communication and not in DtoD (or P2P) communication.

My question is, Am I missing something? Have anyone seen Deep Learning applications or any other kind of real applications (benchmarks) that are sensitive to GPUs communication?

Thank you in advance.


This is outside my area of expertise, but I note that the following paper mentions use of P2P:

Linnan Wang, Wei Wu, George Bosilca, Richard Vuduc, Zenglin Xu, “Efficient Communications in Training Large Scale Neural Networks”, arXiv preprint, Nov. 2016 (https://arxiv.org/pdf/1611.04255.pdf):

Thank you njuffa, it is an interesting paper.

However, the paper say actually that in Caffe implementation there are a lot of GPUs communication, which I am not seeing in my experiments. Or at least, the majority of the data transfer is Host-to-Device.

One of the reason might be that the paper uses application over MPI, and I am using Caffe in only one machine without MPI.

For further information about my application:
==29838== Profiling application: ./build/tools/caffe train --solver=models/bvlc_reference_caffenet/solver.prototxt -gpu 0,1
==29838== Profiling result:
Time(%) Time Calls Avg Min Max Name
71.57% 34.6426s 129456 267.60us 228.86us 481.62us sgemm_sm35_ldg_nn_64x16x64x16x16
11.30% 5.46827s 80928 67.569us 29.600us 141.08us void caffe::im2col_gpu_kernel(int, float const , int, int, int, int, int, int, int, int, int, int, int, int, caffe::im2col_gpu_kernel)
5.25% 2.54033s 969 2.6216ms 830.44us 5.4433ms sgemm_sm35_ldg_tn_32x16x64x8x16
3.31% 1.60216s 3628 441.61us 1.3120us 22.329ms [CUDA memcpy HtoD]
2.15% 1.03878s 81897 12.684us 7.1680us 26.975us void gemmk1_kernel<float, int=256, int=5, bool=0, bool=0, bool=0, bool=0>(cublasGemmk1Params, float const , float const , float)
1.99% 961.20ms 971 989.91us 266.42us 1.5735ms void caffe::MaxPoolForward(int, float const , int, int, int, int, int, int, int, int, int, int, int, int, caffe::MaxPoolForward, int
, float const )
1.54% 746.81ms 2264 329.86us 13.823us 1.0193ms void caffe::ReLUForward(int, float const , caffe::ReLUForward, caffe::ReLUForward)
0.57% 276.63ms 648 426.89us 326.39us 528.98us void caffe::LRNComputeOutput(int, float const , float const , caffe::LRNComputeOutput, caffe::LRNComputeOutput)
0.56% 272.58ms 323 843.90us 843.50us 844.81us void caffe::kernel_channel_max(int, int, int, float const , caffe::kernel_channel_max)
0.56% 272.57ms 323 843.86us 843.18us 857.86us void caffe::kernel_channel_sum(int, int, int, float const , caffe::kernel_channel_sum)
0.49% 237.80ms 648 366.98us 321.98us 453.97us void caffe::LRNFillScale(int, float const , int, int, int, int, int, caffe::LRNFillScale, caffe::LRNFillScale, caffe::LRNFillScale)
0.36% 175.70ms 1649 106.55us 2.9760us 19.898ms [CUDA memcpy DtoH]
0.29% 142.69ms 971 146.96us 2.3360us 438.00us [CUDA memcpy DtoD]
0.01% 4.2295ms 62 68.217us 1.1840us 1.5499ms [CUDA memset]
0.01% 4.0577ms 646 6.2810us 3.9040us 9.1510us void dot_kernel<float, float, float, int=128, int=0, int=0>(cublasDotParams<float, float>)
0.01% 3.6119ms 323 11.182us 10.752us 12.703us void caffe::kernel_channel_div(int, int, int, int, float const , caffe::kernel_channel_div)
0.01% 3.3833ms 646 5.2370us 4.6710us 6.1760us void asum_kernel<float, float, int=0>(cublasAsumParams<float, float>)
0.01% 3.1036ms 323 9.6080us 9.2800us 9.8230us void caffe::kernel_channel_subtract(int, int, int, int, float const , caffe::kernel_channel_subtract)
0.01% 2.5553ms 646 3.9550us 2.9760us 5.1840us void reduce_1Block_kernel<float, float, float, int=128, int=7>(float
, int, float*)
0.00% 2.2668ms 323 7.0180us 6.8790us 8.4480us void caffe::kernel_exp(int, float const , caffe::kernel_exp)
0.00% 1.9116ms 323 5.9180us 5.8230us 6.1760us void caffe::SoftmaxLossForwardGPU(int, float const , float const , caffe::SoftmaxLossForwardGPU, int, int, int, bool, int, float const *)

==29838== API calls:
Time(%) Time Calls Avg Min Max Name
54.92% 41.1956s 4946 8.3291ms 12.313us 28.897ms cudaMemcpy
16.96% 12.7241s 23 553.22ms 1.5630us 12.3990s cudaFree
16.39% 12.2905s 86 142.91ms 8.6410us 12.2668s cudaMalloc
4.99% 3.73945s 301657 12.396us 9.6850us 27.883ms cudaLaunch
2.28% 1.70803s 3417287 499ns 373ns 9.5152ms cudaSetupArgument
1.89% 1.41735s 1626 871.68us 4.4300us 22.349ms cudaStreamSynchronize
0.81% 609.10ms 1302 467.81us 21.051us 28.922ms cudaMemcpyAsync
0.77% 578.33ms 1348 429.03us 1.5000us 28.901ms cudaEventDestroy
0.43% 319.13ms 498 640.81us 717ns 32.000ms cudaMallocHost
0.24% 181.31ms 301658 601ns 401ns 92.971us cudaConfigureCall
0.15% 113.50ms 215552 526ns 369ns 213.86us cudaGetLastError
0.06% 46.345ms 85459 542ns 395ns 67.213us cudaPeekAtLastError
0.03% 23.014ms 1332 17.277us 1.5550us 9.4235ms cudaEventCreate
0.02% 14.048ms 16 878.02us 47.450us 6.2861ms cudaFreeHost
0.02% 12.400ms 12 1.0333ms 739.84us 1.8946ms cudaGetDeviceProperties
0.01% 9.5804ms 1060 9.0380us 242ns 337.73us cuDeviceGetAttribute
0.01% 6.8362ms 969 7.0540us 5.3460us 192.72us cudaFuncGetAttributes
0.01% 6.7220ms 12 560.16us 553.32us 569.49us cuDeviceTotalMem
0.01% 4.2484ms 969 4.3840us 3.2600us 159.22us cudaEventQuery
0.00% 2.7236ms 1 2.7236ms 2.7236ms 2.7236ms cudaDeviceEnablePeerAccess
0.00% 2.4173ms 969 2.4940us 1.8120us 50.789us cudaEventRecord
0.00% 1.2061ms 62 19.453us 12.218us 83.530us cudaMemset
0.00% 843.32us 128 6.5880us 1.1760us 316.44us cudaEventCreateWithFlags
0.00% 770.11us 12 64.175us 58.939us 72.246us cuDeviceGetName
0.00% 242.33us 3 80.777us 46.544us 132.24us cudaStreamCreateWithFlags
0.00% 162.79us 85 1.9150us 815ns 45.469us cudaGetDevice
0.00% 94.456us 88 1.0730us 600ns 2.8990us cudaDeviceGetAttribute
0.00% 91.937us 17 5.4080us 990ns 18.910us cudaSetDevice
0.00% 25.502us 1 25.502us 25.502us 25.502us cudaDeviceCanAccessPeer
0.00% 18.858us 2 9.4290us 8.6680us 10.190us cudaThreadSynchronize
0.00% 9.4160us 20 470ns 266ns 2.0450us cuDeviceGet
0.00% 6.1370us 5 1.2270us 357ns 4.2030us cuDeviceGetCount
0.00% 2.3950us 2 1.1970us 907ns 1.4880us cuInit
0.00% 1.2410us 2 620ns 439ns 802ns cuDriverGetVersion

As I said, this is not my area of expertise. But as a hypothesis, the amount of GPU-to-GPU communication may depend on the kind and size of network being trained.

You could try contacting the corresponding author of the paper to ask about the discrepancy in your observations and theirs.

As for MPI, you would need to use an MPI version that supports the GPU peer-to-peer mechanism and is configured to use it to see any benefits from P2P. Since MPI is a well-established and frequently used parallelization technology, it stands to reason that applications would typically rely on that to exploit GPU P2P, rather than trying to set up P2P themselves; why reinvent the wheel?

If you do a quick literature search with Google Scholar, you will find other real-life applications that make use of GPU P2P: I recall seeing an oil & gas application and a molecular dynamics application, for example. This is how I found the paper I mentioned earlier in just a couple of minutes of perusing search results.