NCCL and D2D data moving across GPU devices

I’m evaluating NCCL but confused on its data moving (AllReduce, etc) cross GPU, which sounds doesn’t match
here’s my setup:

  • P100*4, where first 2 GPU connected via PCIe switch to socket0; another two GPU in different PCIe switch/socket1.
  • cuda8, cuDNN6
  • NCCL 1.0 for micro-benchmark, NCCL 2.0 in Tensorflow (1.3).

My quick questions:
I run and profiled the NCCL (1.) built-in benchmark (e.g., all_reduce_test, broadcast_test etc), but I didn’t find any memcpyDtoD nor P2P shown in the nvprof. so any new internal cuda driver API for NCCL send/recv data? given a complex 3rd party App, if I want to profile the traffic across dev in single node, any tools or recommended param to nvprof for that purpose?

similarly in Tensorflow 1.3 + NCCL2 case (vgg16 for example, using 2/4 GPU), looks I can’t correctly profile or collect the data moving stats. or Am I miss something?

pls help and advice. thanks.

Frank

below are nvprof of nccl all-reduce on 2 GPUs, no D2D/p2p is found, then any new driver API to move data? how to profile it?

/# nvprof -o %p --profile-all-processes --unified-memory-profiling per-process-device --print-gpu-trace
======== Profiling all processes launched by user “root”
======== Type “Ctrl-c” to exit
==38973== NVPROF is profiling process 38973, command: ./all_reduce_test 1000000 2
==38973== Profiling application: ./all_reduce_test 1000000 2
==38973== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput Device Context Stream Name
588.92ms 1.1200us - - - - - 2.0625MB 1798.4GB/s Tesla P100-PCIE 1 7 [CUDA memset]
954.40ms 1.1200us - - - - - 2.0625MB 1798.4GB/s Tesla P100-PCIE 2 15 [CUDA memset]
957.67ms 1.8240us - - - - - 8B 4.1828MB/s Tesla P100-PCIE 1 7 [CUDA memcpy HtoD]
957.68ms 1.2480us - - - - - 208B 158.95MB/s Tesla P100-PCIE 1 7 [CUDA memcpy HtoD]
957.96ms 1.5360us - - - - - 8B 4.9671MB/s Tesla P100-PCIE 2 15 [CUDA memcpy HtoD]
957.97ms 1.1840us - - - - - 208B 167.54MB/s Tesla P100-PCIE 2 15 [CUDA memcpy HtoD]
957.99ms 1.2160us - - - - - 216B 169.40MB/s Tesla P100-PCIE 2 15 [CUDA memcpy HtoD]
958.00ms 1.1840us - - - - - 8B 6.4437MB/s Tesla P100-PCIE 2 15 [CUDA memcpy HtoD]
958.02ms 1.2160us - - - - - 216B 169.40MB/s Tesla P100-PCIE 2 15 [CUDA memcpy HtoD]
958.03ms 1.1840us - - - - - 8B 6.4437MB/s Tesla P100-PCIE 2 15 [CUDA memcpy HtoD]
960.95ms 1.1200us - - - - - 976.56KB 831.54GB/s Tesla P100-PCIE 1 7 [CUDA memset]
978.61ms 1.8240us - - - - - 800B 418.28MB/s Tesla P100-PCIE 1 7 [CUDA memcpy HtoD]
978.62ms 1.3120us - - - - - 800B 581.51MB/s Tesla P100-PCIE 1 7 [CUDA memcpy HtoD]
978.63ms 1.3440us - - - - - 800B 567.66MB/s Tesla P100-PCIE 1 7 [CUDA memcpy HtoD]
978.64ms 2.6880us - - - - - 12.500KB 4.4349GB/s Tesla P100-PCIE 1 7 [CUDA memcpy HtoD]
978.65ms 2.7200us - - - - - 12.500KB 4.3827GB/s Tesla P100-PCIE 1 7 [CUDA memcpy HtoD]
978.66ms 2.7200us - - - - - 12.500KB 4.3827GB/s Tesla P100-PCIE 1 7 [CUDA memcpy HtoD]
978.67ms 1.2160us - - - - - 4B 3.1371MB/s Tesla P100-PCIE 1 7 [CUDA memcpy HtoD]
979.43ms 23.391us - - - - - 257.50KB 10.499GB/s Tesla P100-PCIE 1 7 [CUDA memcpy HtoD]
979.58ms 23.264us - - - - - 257.50KB 10.556GB/s Tesla P100-PCIE 1 7 [CUDA memcpy HtoD]
979.64ms 27.552us (64 1 1) (256 1 1) 26 0B 0B - - Tesla P100-PCIE 1 7 void gen_mtgp<curandStateMtgp32, float, int, operator&(float curand_uniform_noargs(curandStateMtgp32*, int))>(curandStateMtgp32*, float*, unsigned long, unsigned long, int) [836]
979.67ms 3.7440us (64 1 1) (256 1 1) 26 0B 0B - - Tesla P100-PCIE 1 7 void gen_mtgp<curandStateMtgp32, unsigned int, int, operator&
(unsigned int curand_noargs(curandStateMtgp32*, int))>(curandStateMtgp32*, unsigned int*, unsigned long, unsigned long, int) [843]
979.67ms 1.8880us (64 1 1) (256 1 1) 8 0B 0B - - Tesla P100-PCIE 1 7 void cpy_mtgp<float, int, operator&(float curand_uniform_cpy_noargs(float, int))>(float*, unsigned int*, unsigned long, unsigned long, int) [850]
980.28ms 77.376us - - - - - 976.56KB 12.036GB/s Tesla P100-PCIE 1 7 [CUDA memcpy DtoH]
980.87ms 1.0880us - - - - - 976.56KB 856.00GB/s Tesla P100-PCIE 2 15 [CUDA memset]
993.83ms 1.4400us - - - - - 800B 529.82MB/s Tesla P100-PCIE 2 15 [CUDA memcpy HtoD]
993.84ms 1.2800us - - - - - 800B 596.05MB/s Tesla P100-PCIE 2 15 [CUDA memcpy HtoD]
993.85ms 1.2800us - - - - - 800B 596.05MB/s Tesla P100-PCIE 2 15 [CUDA memcpy HtoD]
993.86ms 2.6560us - - - - - 12.500KB 4.4883GB/s Tesla P100-PCIE 2 15 [CUDA memcpy HtoD]
993.87ms 2.6560us - - - - - 12.500KB 4.4883GB/s Tesla P100-PCIE 2 15 [CUDA memcpy HtoD]
993.88ms 2.6560us - - - - - 12.500KB 4.4883GB/s Tesla P100-PCIE 2 15 [CUDA memcpy HtoD]
993.89ms 1.1840us - - - - - 4B 3.2219MB/s Tesla P100-PCIE 2 15 [CUDA memcpy HtoD]
994.72ms 23.424us - - - - - 257.50KB 10.484GB/s Tesla P100-PCIE 2 15 [CUDA memcpy HtoD]
994.85ms 23.327us - - - - - 257.50KB 10.527GB/s Tesla P100-PCIE 2 15 [CUDA memcpy HtoD]
994.90ms 27.552us (64 1 1) (256 1 1) 26 0B 0B - - Tesla P100-PCIE 2 15 void gen_mtgp<curandStateMtgp32, float, int, operator&
(float curand_uniform_noargs(curandStateMtgp32*, int))>(curandStateMtgp32*, float*, unsigned long, unsigned long, int) [886]
994.93ms 3.7120us (64 1 1) (256 1 1) 26 0B 0B - - Tesla P100-PCIE 2 15 void gen_mtgp<curandStateMtgp32, unsigned int, int, operator&(unsigned int curand_noargs(curandStateMtgp32*, int))>(curandStateMtgp32*, unsigned int*, unsigned long, unsigned long, int) [893]
994.94ms 1.8240us (64 1 1) (256 1 1) 8 0B 0B - - Tesla P100-PCIE 2 15 void cpy_mtgp<float, int, operator&
(float curand_uniform_cpy_noargs(float, int))>(float*, unsigned int*, unsigned long, unsigned long, int) [900]
996.36ms 127.49us (256 1 1) (256 1 1) 10 0B 0B - - Tesla P100-PCIE 2 15 void accumKern<float, int=0>(float*, float const , int) [913]
997.01ms 126.37us (1 1 1) (513 1 1) 96 216B 0B - - Tesla P100-PCIE 1 22 void AllReduceKernel<int=512, int=8, FuncSum, float>(KernelArgs<FuncSum>) [919]
997.03ms 108.93us (1 1 1) (513 1 1) 96 216B 0B - - Tesla P100-PCIE 2 23 void AllReduceKernel<int=512, int=8, FuncSum, float>(KernelArgs<FuncSum>) [925]
997.17ms 115.07us (1 1 1) (513 1 1) 96 216B 0B - - Tesla P100-PCIE 1 22 void AllReduceKernel<int=512, int=8, FuncSum, float>(KernelArgs<FuncSum>) [934]
997.18ms 106.50us (1 1 1) (513 1 1) 96 216B 0B - - Tesla P100-PCIE 2 23 void AllReduceKernel<int=512, int=8, FuncSum, float>(KernelArgs<FuncSum>) [939]
998.50ms 530.52us (1 1 1) (512 1 1) 11 4.0000KB 0B - - Tesla P100-PCIE 1 7 void deltaKern<float, int=512>(float const , float const , int, double) [955]
1.00096s 540.41us (1 1 1) (512 1 1) 11 4.0000KB 0B - - Tesla P100-PCIE 2 15 void deltaKern<float, int=512>(float const , float const , int, double) [968]
1.00225s 117.60us (1 1 1) (513 1 1) 96 216B 0B - - Tesla P100-PCIE 1 22 void AllReduceKernel<int=512, int=8, FuncSum, float>(KernelArgs<FuncSum>) [974]
1.00226s 107.78us (1 1 1) (513 1 1) 96 216B 0B - - Tesla P100-PCIE 2 23 void AllReduceKernel<int=512, int=8, FuncSum, float>(KernelArgs<FuncSum>) [979]
1.00358s 528.79us (1 1 1) (512 1 1) 11 4.0000KB 0B - - Tesla P100-PCIE 1 7 void deltaKern<float, int=512>(float const , float const , int, double) [995]
1.00601s 538.04us (1 1 1) (512 1 1) 11 4.0000KB 0B - - Tesla P100-PCIE 2 15 void deltaKern<float, int=512>(float const , float const , int, double) [1008]
1.00740s 1.3440us - - - - - 976.56KB 692.95GB/s Tesla P100-PCIE 1 7 [CUDA memset]
1.00863s 1.6640us - - - - - 800B 458.50MB/s Tesla P100-PCIE 1 7 [CUDA memcpy HtoD]
1.00864s 1.3120us - - - - - 800B 581.51MB/s Tesla P100-PCIE 1 7 [CUDA memcpy HtoD]
1.00865s 1.3120us - - - - - 800B 581.51MB/s Tesla P100-PCIE 1 7 [CUDA memcpy HtoD]
1.00866s 2.7200us - - - - - 12.500KB 4.3827GB/s Tesla P100-PCIE 1 7 [CUDA memcpy HtoD]
1.00867s 2.7200us - - - - - 12.500KB 4.3827GB/s Tesla P100-PCIE 1 7 [CUDA memcpy HtoD]
1.00869s 2.7200us - - - - - 12.500KB 4.3827GB/s Tesla P100-PCIE 1 7 [CUDA memcpy HtoD]
1.00870s 1.2160us - - - - - 4B 3.1371MB/s Tesla P100-PCIE 1 7 [CUDA memcpy HtoD]
1.00936s 23.776us - - - - - 257.50KB 10.329GB/s Tesla P100-PCIE 1 7 [CUDA memcpy HtoD]
1.00949s 23.616us - - - - - 257.50KB 10.399GB/s Tesla P100-PCIE 1 7 [CUDA memcpy HtoD]
1.00954s 27.039us (64 1 1) (256 1 1) 26 0B 0B - - Tesla P100-PCIE 1 7 void gen_mtgp<curandStateMtgp32, float, int, operator&_(float curand_uniform_noargs(curandStateMtgp32
, int))>(curandStateMtgp32
, float
, unsigned long, unsigned long, int) [1043]
1.00957s 3.8080us (64 1 1) (256 1 1) 26 0B 0B - - Tesla P100-PCIE 1 7 void gen_mtgp<curandStateMtgp32, unsigned int, int, operator&_(unsigned int curand_noargs(curandStateMtgp32
, int))>(curandStateMtgp32
, unsigned int*, unsigned long, unsigned long, int) [1050]
1.00958s 1.5040us (64 1 1) (256 1 1) 8 0B 0B - - Tesla P100-PCIE 1 7 void cpy_mtgp<float, int, operator&(float curand_uniform_cpy_noargs(float, int))>(float*, unsigned int*, unsigned long, unsigned long, int) [1057]
1.01012s 77.631us - - - - - 976.56KB 11.997GB/s Tesla P100-PCIE 1 7 [CUDA memcpy DtoH]
1.01032s 1.2800us - - - - - 976.56KB 727.60GB/s Tesla P100-PCIE 2 15 [CUDA memset]
1.01154s 1.4720us - - - - - 800B 518.30MB/s Tesla P100-PCIE 2 15 [CUDA memcpy HtoD]
1.01155s 1.2800us - - - - - 800B 596.05MB/s Tesla P100-PCIE 2 15 [CUDA memcpy HtoD]
1.01156s 1.2800us - - - - - 800B 596.05MB/s Tesla P100-PCIE 2 15 [CUDA memcpy HtoD]
1.01157s 2.6880us - - - - - 12.500KB 4.4349GB/s Tesla P100-PCIE 2 15 [CUDA memcpy HtoD]
1.01158s 2.6880us - - - - - 12.500KB 4.4349GB/s Tesla P100-PCIE 2 15 [CUDA memcpy HtoD]
1.01159s 2.6880us - - - - - 12.500KB 4.4349GB/s Tesla P100-PCIE 2 15 [CUDA memcpy HtoD]
1.01160s 1.2160us - - - - - 4B 3.1371MB/s Tesla P100-PCIE 2 15 [CUDA memcpy HtoD]
1.01227s 23.712us - - - - - 257.50KB 10.356GB/s Tesla P100-PCIE 2 15 [CUDA memcpy HtoD]
1.01240s 23.616us - - - - - 257.50KB 10.399GB/s Tesla P100-PCIE 2 15 [CUDA memcpy HtoD]
1.01245s 27.263us (64 1 1) (256 1 1) 26 0B 0B - - Tesla P100-PCIE 2 15 void gen_mtgp<curandStateMtgp32, float, int, operator&
(float curand_uniform_noargs(curandStateMtgp32*, int))>(curandStateMtgp32*, float*, unsigned long, unsigned long, int) [1093]
1.01248s 3.9680us (64 1 1) (256 1 1) 26 0B 0B - - Tesla P100-PCIE 2 15 void gen_mtgp<curandStateMtgp32, unsigned int, int, operator&(unsigned int curand_noargs(curandStateMtgp32*, int))>(curandStateMtgp32*, unsigned int*, unsigned long, unsigned long, int) [1100]
1.01249s 1.7600us (64 1 1) (256 1 1) 8 0B 0B - - Tesla P100-PCIE 2 15 void cpy_mtgp<float, int, operator&
(float curand_uniform_cpy_noargs(float, int))>(float*, unsigned int*, unsigned long, unsigned long, int) [1107]
1.01384s 123.78us (256 1 1) (256 1 1) 10 0B 0B - - Tesla P100-PCIE 2 15 void accumKern<float, int=1>(float*, float const , int) [1120]
1.01448s 122.27us (1 1 1) (513 1 1) 96 216B 0B - - Tesla P100-PCIE 1 24 void AllReduceKernel<int=512, int=8, FuncProd, float>(KernelArgs<FuncProd>) [1126]
1.01450s 108.83us (1 1 1) (513 1 1) 96 216B 0B - - Tesla P100-PCIE 2 25 void AllReduceKernel<int=512, int=8, FuncProd, float>(KernelArgs<FuncProd>) [1132]
1.01463s 114.53us (1 1 1) (513 1 1) 96 216B 0B - - Tesla P100-PCIE 1 24 void AllReduceKernel<int=512, int=8, FuncProd, float>(KernelArgs<FuncProd>) [1141]
1.01463s 107.52us (1 1 1) (513 1 1) 96 216B 0B - - Tesla P100-PCIE 2 25 void AllReduceKernel<int=512, int=8, FuncProd, float>(KernelArgs<FuncProd>) [1146]
1.01593s 529.18us (1 1 1) (512 1 1) 11 4.0000KB 0B - - Tesla P100-PCIE 1 7 void deltaKern<float, int=512>(float const , float const , int, double) [1162]
1.01835s 537.66us (1 1 1) (512 1 1) 11 4.0000KB 0B - - Tesla P100-PCIE 2 15 void deltaKern<float, int=512>(float const , float const , int, double) [1175]
1.01965s 117.63us (1 1 1) (513 1 1) 96 216B 0B - - Tesla P100-PCIE 1 24 void AllReduceKernel<int=512, int=8, FuncProd, float>(KernelArgs<FuncProd>) [1181]
1.01966s 107.74us (1 1 1) (513 1 1) 96 216B 0B - - Tesla P100-PCIE 2 25 void AllReduceKernel<int=512, int=8, FuncProd, float>(KernelArgs<FuncProd>) [1186]
1.02096s 526.52us (1 1 1) (512 1 1) 11 4.0000KB 0B - - Tesla P100-PCIE 1 7 void deltaKern<float, int=512>(float const , float const , int, double) [1202]
1.02340s 537.15us (1 1 1) (512 1 1) 11 4.0000KB 0B - - Tesla P100-PCIE 2 15 void deltaKern<float, int=512>(float const , float const , int, double) [1215]
1.02477s 1.3120us - - - - - 976.56KB 709.85GB/s Tesla P100-PCIE 1 7 [CUDA memset]
1.02600s 1.6640us - - - - - 800B 458.50MB/s Tesla P100-PCIE 1 7 [CUDA memcpy HtoD]
1.02601s 1.3120us - - - - - 800B 581.51MB/s Tesla P100-PCIE 1 7 [CUDA memcpy HtoD]
1.02602s 1.3440us - - - - - 800B 567.66MB/s Tesla P100-PCIE 1 7 [CUDA memcpy HtoD]
1.02603s 2.7200us - - - - - 12.500KB 4.3827GB/s Tesla P100-PCIE 1 7 [CUDA memcpy HtoD]
1.02604s 2.7200us - - - - - 12.500KB 4.3827GB/s Tesla P100-PCIE 1 7 [CUDA memcpy HtoD]
1.02605s 2.6880us - - - - - 12.500KB 4.4349GB/s Tesla P100-PCIE 1 7 [CUDA memcpy HtoD]
1.02606s 1.2160us - - - - - 4B 3.1371MB/s Tesla P100-PCIE 1 7 [CUDA memcpy HtoD]
1.02674s 23.775us - - - - - 257.50KB 10.329GB/s Tesla P100-PCIE 1 7 [CUDA memcpy HtoD]
1.02688s 23.744us - - - - - 257.50KB 10.342GB/s Tesla P100-PCIE 1 7 [CUDA memcpy HtoD]
1.02693s 26.816us (64 1 1) (256 1 1) 26 0B 0B - - Tesla P100-PCIE 1 7 void gen_mtgp<curandStateMtgp32, float, int, operator&_(float curand_uniform_noargs(curandStateMtgp32
, int))>(curandStateMtgp32
, float
, unsigned long, unsigned long, int) [1250]
1.02695s 3.8080us (64 1 1) (256 1 1) 26 0B 0B - - Tesla P100-PCIE 1 7 void gen_mtgp<curandStateMtgp32, unsigned int, int, operator&_(unsigned int curand_noargs(curandStateMtgp32
, int))>(curandStateMtgp32
, unsigned int*, unsigned long, unsigned long, int) [1257]
1.02696s 1.5360us (64 1 1) (256 1 1) 8 0B 0B - - Tesla P100-PCIE 1 7 void cpy_mtgp<float, int, operator&(float curand_uniform_cpy_noargs(float, int))>(float*, unsigned int*, unsigned long, unsigned long, int) [1264]
1.02751s 77.567us - - - - - 976.56KB 12.007GB/s Tesla P100-PCIE 1 7 [CUDA memcpy DtoH]
1.02771s 1.2800us - - - - - 976.56KB 727.60GB/s Tesla P100-PCIE 2 15 [CUDA memset]
1.02891s 1.4720us - - - - - 800B 518.30MB/s Tesla P100-PCIE 2 15 [CUDA memcpy HtoD]
1.02892s 1.2800us - - - - - 800B 596.05MB/s Tesla P100-PCIE 2 15 [CUDA memcpy HtoD]
1.02893s 1.2800us - - - - - 800B 596.05MB/s Tesla P100-PCIE 2 15 [CUDA memcpy HtoD]
1.02894s 2.6560us - - - - - 12.500KB 4.4883GB/s Tesla P100-PCIE 2 15 [CUDA memcpy HtoD]
1.02895s 2.6560us - - - - - 12.500KB 4.4883GB/s Tesla P100-PCIE 2 15 [CUDA memcpy HtoD]
1.02896s 2.6560us - - - - - 12.500KB 4.4883GB/s Tesla P100-PCIE 2 15 [CUDA memcpy HtoD]
1.02897s 1.1840us - - - - - 4B 3.2219MB/s Tesla P100-PCIE 2 15 [CUDA memcpy HtoD]
1.02963s 23.871us - - - - - 257.50KB 10.287GB/s Tesla P100-PCIE 2 15 [CUDA memcpy HtoD]
1.02977s 23.648us - - - - - 257.50KB 10.384GB/s Tesla P100-PCIE 2 15 [CUDA memcpy HtoD]
1.02982s 27.488us (64 1 1) (256 1 1) 26 0B 0B - - Tesla P100-PCIE 2 15 void gen_mtgp<curandStateMtgp32, float, int, operator&
(float curand_uniform_noargs(curandStateMtgp32*, int))>(curandStateMtgp32*, float*, unsigned long, unsigned long, int) [1300]
1.02985s 3.8080us (64 1 1) (256 1 1) 26 0B 0B - - Tesla P100-PCIE 2 15 void gen_mtgp<curandStateMtgp32, unsigned int, int, operator&(unsigned int curand_noargs(curandStateMtgp32*, int))>(curandStateMtgp32*, unsigned int*, unsigned long, unsigned long, int) [1307]
1.02985s 2.1120us (64 1 1) (256 1 1) 8 0B 0B - - Tesla P100-PCIE 2 15 void cpy_mtgp<float, int, operator&
(float curand_uniform_cpy_noargs(float, int))>(float*, unsigned int*, unsigned long, unsigned long, int) [1314]
1.03120s 123.74us (256 1 1) (256 1 1) 10 0B 0B - - Tesla P100-PCIE 2 15 void accumKern<float, int=2>(float*, float const , int) [1327]
1.03184s 124.70us (1 1 1) (513 1 1) 96 216B 0B - - Tesla P100-PCIE 1 26 void AllReduceKernel<int=512, int=8, FuncMax, float>(KernelArgs<FuncMax>) [1333]
1.03185s 109.66us (1 1 1) (513 1 1) 96 216B 0B - - Tesla P100-PCIE 2 27 void AllReduceKernel<int=512, int=8, FuncMax, float>(KernelArgs<FuncMax>) [1339]
1.03198s 114.62us (1 1 1) (513 1 1) 96 216B 0B - - Tesla P100-PCIE 1 26 void AllReduceKernel<int=512, int=8, FuncMax, float>(KernelArgs<FuncMax>) [1348]
1.03199s 107.62us (1 1 1) (513 1 1) 96 216B 0B - - Tesla P100-PCIE 2 27 void AllReduceKernel<int=512, int=8, FuncMax, float>(KernelArgs<FuncMax>) [1353]
1.03331s 527.10us (1 1 1) (512 1 1) 11 4.0000KB 0B - - Tesla P100-PCIE 1 7 void deltaKern<float, int=512>(float const , float const , int, double) [1369]
1.03573s 535.32us (1 1 1) (512 1 1) 11 4.0000KB 0B - - Tesla P100-PCIE 2 15 void deltaKern<float, int=512>(float const , float const , int, double) [1382]
1.03700s 117.50us (1 1 1) (513 1 1) 96 216B 0B - - Tesla P100-PCIE 1 26 void AllReduceKernel<int=512, int=8, FuncMax, float>(KernelArgs<FuncMax>) [1388]
1.03702s 107.39us (1 1 1) (513 1 1) 96 216B 0B - - Tesla P100-PCIE 2 27 void AllReduceKernel<int=512, int=8, FuncMax, float>(KernelArgs<FuncMax>) [1393]
1.03830s 526.43us (1 1 1) (512 1 1) 11 4.0000KB 0B - - Tesla P100-PCIE 1 7 void deltaKern<float, int=512>(float const , float const , int, double) [1409]
1.04076s 537.53us (1 1 1) (512 1 1) 11 4.0000KB 0B - - Tesla P100-PCIE 2 15 void deltaKern<float, int=512>(float const , float const , int, double) [1422]
1.04213s 1.3120us - - - - - 976.56KB 709.85GB/s Tesla P100-PCIE 1 7 [CUDA memset]
1.04341s 1.6960us - - - - - 800B 449.85MB/s Tesla P100-PCIE 1 7 [CUDA memcpy HtoD]
1.04342s 1.3440us - - - - - 800B 567.66MB/s Tesla P100-PCIE 1 7 [CUDA memcpy HtoD]
1.04343s 1.3120us - - - - - 800B 581.51MB/s Tesla P100-PCIE 1 7 [CUDA memcpy HtoD]
1.04344s 2.7200us - - - - - 12.500KB 4.3827GB/s Tesla P100-PCIE 1 7 [CUDA memcpy HtoD]
1.04345s 2.7200us - - - - - 12.500KB 4.3827GB/s Tesla P100-PCIE 1 7 [CUDA memcpy HtoD]
1.04346s 2.6880us - - - - - 12.500KB 4.4349GB/s Tesla P100-PCIE 1 7 [CUDA memcpy HtoD]
1.04348s 1.2160us - - - - - 4B 3.1371MB/s Tesla P100-PCIE 1 7 [CUDA memcpy HtoD]
1.04415s 23.679us - - - - - 257.50KB 10.371GB/s Tesla P100-PCIE 1 7 [CUDA memcpy HtoD]
1.04428s 23.871us - - - - - 257.50KB 10.287GB/s Tesla P100-PCIE 1 7 [CUDA memcpy HtoD]
1.04433s 26.880us (64 1 1) (256 1 1) 26 0B 0B - - Tesla P100-PCIE 1 7 void gen_mtgp<curandStateMtgp32, float, int, operator&_(float curand_uniform_noargs(curandStateMtgp32
, int))>(curandStateMtgp32
, float
, unsigned long, unsigned long, int) [1457]
1.04436s 3.9680us (64 1 1) (256 1 1) 26 0B 0B - - Tesla P100-PCIE 1 7 void gen_mtgp<curandStateMtgp32, unsigned int, int, operator&_(unsigned int curand_noargs(curandStateMtgp32
, int))>(curandStateMtgp32
, unsigned int*, unsigned long, unsigned long, int) [1464]
1.04437s 1.5040us (64 1 1) (256 1 1) 8 0B 0B - - Tesla P100-PCIE 1 7 void cpy_mtgp<float, int, operator&(float curand_uniform_cpy_noargs(float, int))>(float*, unsigned int*, unsigned long, unsigned long, int) [1471]
1.04491s 78.080us - - - - - 976.56KB 11.928GB/s Tesla P100-PCIE 1 7 [CUDA memcpy DtoH]
1.04510s 1.2480us - - - - - 976.56KB 746.25GB/s Tesla P100-PCIE 2 15 [CUDA memset]
1.04633s 1.4720us - - - - - 800B 518.30MB/s Tesla P100-PCIE 2 15 [CUDA memcpy HtoD]
1.04634s 1.3120us - - - - - 800B 581.51MB/s Tesla P100-PCIE 2 15 [CUDA memcpy HtoD]
1.04635s 1.2800us - - - - - 800B 596.05MB/s Tesla P100-PCIE 2 15 [CUDA memcpy HtoD]
1.04636s 2.6240us - - - - - 12.500KB 4.5430GB/s Tesla P100-PCIE 2 15 [CUDA memcpy HtoD]
1.04637s 2.6550us - - - - - 12.500KB 4.4900GB/s Tesla P100-PCIE 2 15 [CUDA memcpy HtoD]
1.04638s 2.6560us - - - - - 12.500KB 4.4883GB/s Tesla P100-PCIE 2 15 [CUDA memcpy HtoD]
1.04639s 1.1840us - - - - - 4B 3.2219MB/s Tesla P100-PCIE 2 15 [CUDA memcpy HtoD]
1.04705s 23.680us - - - - - 257.50KB 10.370GB/s Tesla P100-PCIE 2 15 [CUDA memcpy HtoD]
1.04719s 23.552us - - - - - 257.50KB 10.427GB/s Tesla P100-PCIE 2 15 [CUDA memcpy HtoD]
1.04724s 27.551us (64 1 1) (256 1 1) 26 0B 0B - - Tesla P100-PCIE 2 15 void gen_mtgp<curandStateMtgp32, float, int, operator&
(float curand_uniform_noargs(curandStateMtgp32*, int))>(curandStateMtgp32*, float*, unsigned long, unsigned long, int) [1507]
1.04727s 3.7120us (64 1 1) (256 1 1) 26 0B 0B - - Tesla P100-PCIE 2 15 void gen_mtgp<curandStateMtgp32, unsigned int, int, operator&(unsigned int curand_noargs(curandStateMtgp32*, int))>(curandStateMtgp32*, unsigned int*, unsigned long, unsigned long, int) [1514]
1.04727s 1.6000us (64 1 1) (256 1 1) 8 0B 0B - - Tesla P100-PCIE 2 15 void cpy_mtgp<float, int, operator&
(float curand_uniform_cpy_noargs(float, int))>(float*, unsigned int*, unsigned long, unsigned long, int) [1521]
1.04861s 126.88us (256 1 1) (256 1 1) 10 0B 0B - - Tesla P100-PCIE 2 15 void accumKern<float, int=3>(float*, float const *, int) [1534]
1.04925s 124.90us (1 1 1) (513 1 1) 96 216B 0B - - Tesla P100-PCIE 1 28 void AllReduceKernel<int=512, int=8, FuncMin, float>(KernelArgs<FuncMin>) [1539]
1.04926s 108.61us (1 1 1) (513 1 1) 96 216B 0B - - Tesla P100-PCIE 2 29 void AllReduceKernel<int=512, int=8, FuncMin, float>(KernelArgs<FuncMin>) [1545]
1.04939s 115.20us (1 1 1) (513 1 1) 96 216B 0B - - Tesla P100-PCIE 1 28 void AllReduceKernel<int=512, int=8, FuncMin, float>(KernelArgs<FuncMin>) [1554]
1.04940s 107.71us (1 1 1) (513 1 1) 96 216B 0B - - Tesla P100-PCIE 2 29 void AllReduceKernel<int=512, int=8, FuncMin, float>(KernelArgs<FuncMin>) [1559]
1.05070s 528.35us (1 1 1) (512 1 1) 11 4.0000KB 0B - - Tesla P100-PCIE 1 7 void deltaKern<float, int=512>(float const *, float const , int, double) [1575]
1.05314s 536.70us (1 1 1) (512 1 1) 11 4.0000KB 0B - - Tesla P100-PCIE 2 15 void deltaKern<float, int=512>(float const *, float const , int, double) [1588]
1.05443s 118.40us (1 1 1) (513 1 1) 96 216B 0B - - Tesla P100-PCIE 1 28 void AllReduceKernel<int=512, int=8, FuncMin, float>(KernelArgs<FuncMin>) [1594]
1.05444s 107.90us (1 1 1) (513 1 1) 96 216B 0B - - Tesla P100-PCIE 2 29 void AllReduceKernel<int=512, int=8, FuncMin, float>(KernelArgs<FuncMin>) [1599]
1.05572s 527.13us (1 1 1) (512 1 1) 11 4.0000KB 0B - - Tesla P100-PCIE 1 7 void deltaKern<float, int=512>(float const *, float const , int, double) [1615]
1.05813s 538.27us (1 1 1) (512 1 1) 11 4.0000KB 0B - - Tesla P100-PCIE 2 15 void deltaKern<float, int=512>(float const *, float const , int, double) [1628]
1.05950s 969ns - - - - - 8B 7.8735MB/s Tesla P100-PCIE 1 7 [CUDA memset]
1.05951s 477ns - - - - - 208B 415.86MB/s Tesla P100-PCIE 1 7 [CUDA memset]
1.06068s 716ns - - - - - 8B 10.656MB/s Tesla P100-PCIE 2 15 [CUDA memset]
1.06068s 325ns - - - - - 208B 610.35MB/s Tesla P100-PCIE 2 15 [CUDA memset]