Questions About Using NVIDIA DALI and GDS

Hello,
I am testing the performance differences with and without GDS using a slightly modified version of NVIDIA DALI’s ResNet50 example code.
The testing environment is a Kubernetes cluster where I create a Job on a DGX-H100 node with Vast Data mounted as storage.

For reference, when mounting storage with the RDMA protocol, I added only the proto=rdma option.

During testing, I observed the following issues and would like to ask for clarification:

  1. When storage is mounted using the TCP protocol:

    • In a container where Privileged Mode is disabled, the Python script executes successfully even when the DALI numpy reader’s device is set to GPU. The cufile.log generated in the script’s directory indicates that GDS is running in Compatible Mode.
    • However, in a container with Privileged Mode enabled, the script fails with the error: Assertion failure, file index :0 line :1388
    • What causes this difference between Privileged Mode being enabled and disabled?
    • I previously asked about this issue in the DALI GitHub, and they suggested that this forum might be a better place to ask.
  2. When GDS operates in Compatible Mode:

    • What are the performance and functional differences when the DALI numpy reader’s device is set to CPU versus GPU?

Thank you for your assistance.

1 Like