I am developing a file loader with GPU Direct Storage. Its performance with a single GPU was pretty nice with a single GPU (cuFileDriverOpen() required ~0.5 sec). However, when we used 8 torch distributed processes on 8 different local GPUs, they required ~13 sec just for calling cuFileDriverOpen(), which is relatively too high latency for my situation (320GB file transfer required only ~12 sec).
I noticed that cuFileDriverOpen() can finish shortly if I set CUDA_VISIBLE_DEVICES to be limited number of devices. Unfortunately, I would like to run the file loader with NCCL. I need to set CUDA_VISIBLE_DEVICES to be a full set even though the loader uses GDS with one of GPUs. Setting CUDA_VISIBLE_DEVICES just before cuFileDriverOpen() did not help.
So, my question is, Is there any ways to limit target devices for cuFileDriverOpen()? If it’s not possible, how can I minimize cuFileDriverOpen() latency under this situation?