Slow parallel cuFileDriverOpen() with 8 A100 GPUs

I am developing a file loader with GPU Direct Storage. Its performance with a single GPU was pretty nice with a single GPU (cuFileDriverOpen() required ~0.5 sec). However, when we used 8 torch distributed processes on 8 different local GPUs, they required ~13 sec just for calling cuFileDriverOpen(), which is relatively too high latency for my situation (320GB file transfer required only ~12 sec).

I noticed that cuFileDriverOpen() can finish shortly if I set CUDA_VISIBLE_DEVICES to be limited number of devices. Unfortunately, I would like to run the file loader with NCCL. I need to set CUDA_VISIBLE_DEVICES to be a full set even though the loader uses GDS with one of GPUs. Setting CUDA_VISIBLE_DEVICES just before cuFileDriverOpen() did not help.

So, my question is, Is there any ways to limit target devices for cuFileDriverOpen()? If it’s not possible, how can I minimize cuFileDriverOpen() latency under this situation?

The cost is mostly in initializing the cuda devices and it should be cached after the first call.

Does this happen if NCCL is initialized first and then call cuFileDriverOpen.

Does this happen if NCCL is initialized first and then call cuFileDriverOpen.

My code initialized NCCL after cuFileDriverOpen. If I changed it to initialize NCCL first, then parallel cuFileDriverOpen finished slightly faster (10 seconds).

TRACE log had no messages during the 10 seconds. looks like threads started at cufio_core:84 and did something for a while.