nccl - can we sum up all the values of an array on 1 device GPU to obtain the sum

I have a single GPU (e.g. GeForce GTX 980Ti). I have a single float array, for example, cudaMalloc’ed (allocated on that single device GPU) of length 128, with all values being 1.f. I want to use nccl to sum them up to obtain 128, i.e. (1+1+…+1)=128.

However, I read on the NCCL Developer’s documentation that the reduction is only across devices, NOT across a single device, if I interpreted it correctly:

cf. http://docs.nvidia.com/deeplearning/sdk/nccl-developer-guide/index.html#axzz4rabuBrOP

From there (quoting),
“AllReduce starts with independent arrays Vk of N values on each of K ranks and ends with identical arrays S of N values, where S[i] = V0 [i]+V1 [i]+…+Vk-1 [i], for each rank k .”

I want to confirm that I cannot do a reduction of an array on the device GPU (summation), on a single GPU.

My full code (and how to compile) is here:

https://github.com/ernestyalumni/CompPhys/blob/master/moreCUDA/nccl/Ex01_singleprocess_b.cu

the “meat” of the code is here; the “prep” before (declarations) should be correct:

ncclCommCount(*comm.get(),&count);

ncclAllReduce( d_in.get(), d_out.get(), size, ncclFloat, ncclSum, *comm.get(), *stream.get() );

I had “wrapped” my pointers in C++11 smart pointers, but I have tried my code with raw pointers as well with the same result; I can post that version if you’d like.

Please confirm that I cannot use nccl to do parallel reduce on a single device, across a single array on the single device GPU, or show me how I can. Thanks!

not possible

cross posting:

https://stackoverflow.com/questions/46028541/nccl-can-we-sum-up-all-the-values-of-an-array-on-1-device-gpu-to-obtain-the-su