NCCL vs. MPI for Distributed DL

I would please like to get more information about similarities and differences between NCCL and MPI for inter-node communication in distributed deep learning.

Are these competing technologies, or are they used together?
Are both used by most standard frameworks like Tensorflow, Pytorch, etc.?

I am aware of the Byte Transfer Layer (BTL) in Open-MPI, implemented for supported hardware (eg. tcp, openib, shared memory etc.). Is there something similar for NCCL? How does one know which hardware it will use to communicate between nodes?

I would appreciate any explanation of how these technologies compare and how they fit into the bigger picture.