I have a single GPU (e.g. GeForce GTX 980Ti). I have a single float array, for example, cudaMalloc’ed (allocated on that single device GPU) of length 128, with all values being 1.f. I want to use nccl to sum them up to obtain 128, i.e. (1+1+…+1)=128.
However, I read on the NCCL Developer’s documentation that the reduction is only across devices, NOT across a single device, if I interpreted it correctly:
From there (quoting),
“AllReduce starts with independent arrays Vk of N values on each of K ranks and ends with identical arrays S of N values, where S[i] = V0 [i]+V1 [i]+…+Vk-1 [i], for each rank k .”
I want to confirm that I cannot do a reduction of an array on the device GPU (summation), on a single GPU.
My full code (and how to compile) is here:
the “meat” of the code is here; the “prep” before (declarations) should be correct:
ncclCommCount(*comm.get(),&count); ncclAllReduce( d_in.get(), d_out.get(), size, ncclFloat, ncclSum, *comm.get(), *stream.get() );
I had “wrapped” my pointers in C++11 smart pointers, but I have tried my code with raw pointers as well with the same result; I can post that version if you’d like.
Please confirm that I cannot use nccl to do parallel reduce on a single device, across a single array on the single device GPU, or show me how I can. Thanks!