The playbook is incomplete, and if you set the variables the same way when using it for real workloads, you will lose a lot on latency, because it will be using Ethernet instead of Infiniband.
To get the most, you need to set these two in addition to the ones you already set:
export NCCL_IB_HCA=rocep1s0f1,roceP2p1s0f1
export NCCL_IB_DISABLE=0
Use BOTH roceXXX devices that show as UP in ibdev2netdev. You will see NCCL test speeds increase to ~24 GB/s, but most importantly, the latency will be way down, and you will see real improvements in real workflows (e.g. VLLM, torch distributed, etc).
EDIT: to make sure it uses Infiniband, you can set export NCCL_DEBUG=INFO and read the logs.