NCCL stucks on a node with two A100

I have a node which is installed with two A100, and it gets stuck when I enable NCCL to use two A100 at same time.
Following is a piece of test code, and it stucks at “torch.cuda.synchronize()”.
Could you please help to provide some guildeline about how to debug it?

# Test PyTorch NCCL
import torch
import torch.distributed as dist

dist.init_process_group(backend="nccl")
print("after dist.init_process_group.")
local_rank = dist.get_rank() % torch.cuda.device_count()
print(f"local_rank={local_rank}")
torch.cuda.set_device(local_rank)
data = torch.FloatTensor([1,] * 128).to("cuda")
print(f"data={data}")
dist.all_reduce(data, op=dist.ReduceOp.SUM)
print("before torch.cuda.synchronize")
torch.cuda.synchronize()  ##### stuck here.
print("after torch.cuda.synchronize")
value = data.mean().item()
print(f"value={value}")
world_size = dist.get_world_size()
print(f"world_size={world_size}")
assert value == world_size, f"Expected {world_size}, got {value}"

print("#####PyTorch NCCL is successful!")