NCCL 2.0 support for other Linux OS.

Hi stuff,

I have GitHub version of NCCL 1.x integrated into a my application.
I am interested in moving toward NCCL 2.0 (Github version no more maintained/supported).
I see from the download area that Ubuntu platform is supported.
Do you have any plan to support other Linux OS?
I’m interested in CentOS.\

Thanks,
Franco

Hi Franco,

You may extract deb archives using the “ar” tool.

Below is a script to extract the NCCL debs into a directory. You can use it this way :

$ ./script nccl-2.0.4-1+cuda8.0 libnccl2_2.0.4-1+cuda8.0_amd64.deb libnccl-dev_2.0.4-1+cuda8.0_amd64.deb
#!/bin/bash
# Extract debs into a directory
if [ "$2" == "" ]; then echo "Usage : $0 <dir> <nccl debs>" ; exit 1; fi

DIR=`realpath $1`; shift
DEBS=""
while [ "$1" != "" ]; do DEBS=`realpath $1` ; shift; done

mkdir temp && cd temp
for deb in $DEBS; do
  ar x $deb && tar xf data.tar.xz
  rm data.tar.xz control.tar.gz debian-binary
done

mkdir -p $DIR
mv usr/include usr/share $DIR
mv usr/lib/x86_64-linux-gnu $DIR/lib

cd .. && rm -Rf temp

Hi NCCL team,

thanks pointing me at the script. Let me try!

I had to play a little bit with teh script since I didn’t have “realpath” natively in my installation.
Said that, I unpacked this nccl-repo-ubuntu1404-2.0.4-ga_2.0.4-1_amd64.deb downloaded from nvidia site.

At the end, I cannot find any lib/x86_64-linux-gnu and include file after extraction.

What’s wrong?
How can I have generic x86_64 library and include?
After adding comment at temp folder removal code lines into the script, I see the following.

./
./usr/
./usr/share/
./usr/share/doc/
./usr/share/doc/nccl-repo-ubuntu1404-2.0.4-ga/
./usr/share/doc/nccl-repo-ubuntu1404-2.0.4-ga/changelog.Debian.gz
./var/
./var/nccl-repo-2.0.4-ga/
./var/nccl-repo-2.0.4-ga/Release.gpg
./var/nccl-repo-2.0.4-ga/7fa2af80.pub
./var/nccl-repo-2.0.4-ga/libnccl2_2.0.4-1+cuda8.0_amd64.deb
./var/nccl-repo-2.0.4-ga/libnccl-dev_2.0.4-1+cuda8.0_amd64.deb
./var/nccl-repo-2.0.4-ga/Release
./var/nccl-repo-2.0.4-ga/Packages.gz
./etc/
./etc/apt/
./etc/apt/sources.list.d/
./etc/apt/sources.list.d/nccl-2.0.4-ga.list

Sorry, that’s right, the debs you download are deb repositories. So you should first extract the repository deb :

ar x nccl-repo-ubuntu1404-2.0.4-ga_2.0.4-1_amd64.deb
tar xvf data.tar.xz
rm data.tar.xz control.tar.gz debian-binary

Then use the script to extract the two debs in var/nccl-2.0.4-1+cuda8.0 :

./script nccl-2.0.4-1+cuda8.0 var/nccl-repo-2.0.4-ga/libnccl2_2.0.4-1+cuda8.0_amd64.deb var/nccl-repo-2.0.4-ga/libnccl-dev_2.0.4-1+cuda8.0_amd64.deb

Ok, now it’s fine.

Meanwhile I started having a look at the NNCL2 documentation.
I actually have with NCCL1 (GitHub version) the concurrency problem mentioned in the documentation (I simply pasted in the following the whole paragragh “Concurrency between NCCL and CUDA calls”).

Do you have some sample code or more detailed post where I can have a look at the work-around proposed?
The problem is really quite annoyied and I need to find THE SOLUTION at the problem.
Actually I’m using a CPU barrier (mpi concept) to protect the entering in AllReduce Nickel.
But this solution is time wasting since cpu threads are tightly synchronized as well.

I’ll wait your comment on that,
Franco


Concurrency between NCCL and CUDA calls
NCCL uses CUDA kernels to perform inter-GPU communication. The NCCL kernels synchronize with each other, therefore, each kernel requires other kernels on other GPUs to be also executed in order to complete. The application should therefore make sure that nothing prevents the NCCL kernels from being executed concurrently on the different devices of a NCCL communicator.

For example, let’s say you have a process managing multiple CUDA devices, and, also features a thread which calls CUDA functions asynchronously. In this case, CUDA calls could be executed between the enqueuing of two NCCL kernels. The CUDA call may wait for the first NCCL kernel to complete and prevent the second one from being launched, causing a deadlock since the first kernel will not complete until the second one is executed. To avoid this issue, one solution is to have a lock around the NCCL launch on multiple devices (around ncclGroupStart and ncclGroupEnd when using a single thread, around the NCCL launch when using multiple threads, using thread synchronization if necessary) and take this lock when calling CUDA from the asynchronous thread.

I looked at the download site, but I only see binary support for x86_64 (amd). Is there any plan to support Power binaries (for agnostic systems) ?