Performance Inquiry: Near-Equal Time Spent on cuFileRead()and cuFileHandleNVFS()with GDS over NFS/RDMA

Hello GDS team,
I’m reaching out regarding a performance observation while using DALI with GDS to load data from an NFS over RDMA.
When profiling with Nsight Systems, I noticed that the time spent in the cuFileRead() calls is approximately equal to the time spent in cuFileHandleNVFS(). This near 1:1 ratio in execution time suggests that cuFileHandleNVFS()may itself be performing network I/O operations.
However, I have been unable to locate any public documentation or discussion about cuFileHandleNVFS() online. This makes it difficult to diagnose whether this behavior is expected, or if it points to a configuration or potential performance bottleneck in our stack.
Could you provide some insight into what the cuFileHandleNVFS() function entails, and if the observed timing profile is typical? Alternatively, should this investigation be directed towards our storage provider?
Thank you for your time and assistance.

Hi there @liuyuqi2001 and welcome to the NVIDIA developer forums.

In general DALI support is covered through Github Issues. But since this also touched CUDA performance I moved you post to the relevant CUDA category.

I hope you will find some answers in either place.

Thanks!

Thanks for your work.
Actually the DALI team suggested I reach out to the GDS team, as this involves a cufile API within a GDS-related library. Since libcufile isn’t publicly available as open source, I’m hoping to find some support or direction on the NVIDIA developer forums.

Makes sense. Although I am not sure whether anyone of the GDS team is present on the forums.

GDS questions generally get logged at the GPU accelerated libraries forum, and you will find from time to time GDS team members responding to questions there.

Seems the near-equal time is just because cuFileRead() call cuFileHandleNVFS().
The trace-stack showed in nsys-ui proved that.