DGX practical management for multi user system

Hi There,

Soon our new DGX-2 System will arrive at our faculty and the staff and students are eager to work on it. Right now we have some discussions how the system might be managed so all users will have a nice experience. We are all planning to use our software in docker containers and plan to use our LDAP user management which is already in use.

So here some things we might need some ideas about:

  • Is there any simple way to give one docker container the full highest hardware priority to get some nice benchmarks while others already have ai trainings in their containers up and running? We would like to have benchmarks, when the container has 100% hardware possibilities.
  • How is DIGITS actually used on a many user system? Right now we use it on local machines with reverse proxies so we can access everything safely from everywhere. Do you usually use just one DIGITS instance on a DGX-2 and everybody trusts each other not to move the jobs around?
  • How does the device query example output looks like on a DGX-2 system?
  • If running a tensorrt inference server, is there any login mechanism? or should we consider reverse proxies here as well?

I know the questions are not purely DGX-2 related. But I hope you can answer it anyway. I couldn’t find a support for DGX-2 so far, where such answers could be provided before having the DGX in house.

Wow, lots of good questions!

Is there any simple way to give one docker container the full highest hardware priority to get some nice benchmarks while others already have ai trainings in their containers up and running? We would like to have benchmarks, when the container has 100% hardware possibilities.

I’d strongly encourage you to just have one compute task per GPU, especially if you want to do a benchmark run. Effectively, you want to coordinate each container only seeing some GPUs, and not overlapping with other running containers. Refer to https://github.com/NVIDIA/nvidia-docker/wiki/Usage , but effectively you’d do:

docker run --runtime=nvidia --rm -ti -e NVIDIA_VISIBLE_DEVICES=0,1 my1stcontainer:latest

and for the next do

docker run --runtime=nvidia --rm -ti -e NVIDIA_VISIBLE_DEVICES=2,3 my2ndcontainer:latest

etc. Such that each container has NVIDIA_VISIBLE_DEVICES that doesn’t overlap with others. Which leads to your 2nd question…

How is DIGITS actually used on a many user system? Right now we use it on local machines with reverse proxies so we can access everything safely from everywhere. Do you usually use just one DIGITS instance on a DGX-2 and everybody trusts each other not to move the jobs around?

While in theory DIGITS can be multi-user, most usage of it that I’ve seen is using multiple instances, each with their own data mapped into the container (and/or shared data mapped in if you’ve got a central data pool). E.g., User 1 would run something like:

docker run --runtime=nvidia --rm -ti -e NVIDIA_VISIBLE_DEVICES=0,1 -v /home/userone/private:/private -v /mnt/datastore:/public nvcr.io/nvidia/digits:19.07-tensorflow

User 2 would do the same, but with different GPUs and a different private data store:

docker run --runtime=nvidia --rm -ti -e NVIDIA_VISIBLE_DEVICES=2,3 -v /home/usertwo/private:/private -v /mnt/datastore:/public nvcr.io/nvidia/digits:19.07-tensorflow

This general scheme of allocating and tracking GPUs manually works fine for small teams - many customers just use a Google Doc/Whiteboard/Whatever to track who is using which GPUs. It’s obviously not usable as the team and number of machines grows, and you’d want to look at a job scheduler such as Slurm or Kubernetes. NVIDIA has put together some scripts which make configuring and using those easier: https://github.com/NVIDIA/deepops

As a DGX customer, you also get NVIDIA Enterprise Support for issues you might run into with DeepOps. (If you’re not a DGX customer, then please just make a GitHub issue and the team will try and figure out what’s broken as best they can.)

How does the device query example output looks like on a DGX-2 system?

(It’s obviously very long, so will list it in the next response to keep this one semi-readable!)

If running a tensorrt inference server, is there any login mechanism? or should we consider reverse proxies here as well?

You should put whatever authentication/IDS/etc. that’s appropriate for your environment in front of TRTIS.

Here’s the cut-down deviceQuery sample. It’s highly non-interesting since the DGX-2 architecture with NVSwich means that every GPU can communicate with every other! (This is from a DGX-2H, so some of the GPU-specific details may differ slightly from your DGX-2.)

root@196d745bcfa1:/usr/local/cuda/samples/1_Utilities/deviceQuery# ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 16 CUDA Capable device(s)

Device 0: "Tesla V100-SXM3-32GB-H"
  CUDA Driver Version / Runtime Version          10.1 / 10.1
  CUDA Capability Major/Minor version number:    7.0
  Total amount of global memory:                 32480 MBytes (34058272768 bytes)
  (80) Multiprocessors, ( 64) CUDA Cores/MP:     5120 CUDA Cores
  GPU Max Clock rate:                            1702 MHz (1.70 GHz)
  Memory Clock rate:                             1107 Mhz
  Memory Bus Width:                              4096-bit
  L2 Cache Size:                                 6291456 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 4 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 52 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

[snip]

Device 15: "Tesla V100-SXM3-32GB-H"
  CUDA Driver Version / Runtime Version          10.1 / 10.1
  CUDA Capability Major/Minor version number:    7.0
  Total amount of global memory:                 32480 MBytes (34058272768 bytes)
  (80) Multiprocessors, ( 64) CUDA Cores/MP:     5120 CUDA Cores
  GPU Max Clock rate:                            1702 MHz (1.70 GHz)
  Memory Clock rate:                             1107 Mhz
  Memory Bus Width:                              4096-bit
  L2 Cache Size:                                 6291456 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 4 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 231 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from Tesla V100-SXM3-32GB-H (GPU0) -> Tesla V100-SXM3-32GB-H (GPU1) : Yes
> Peer access from Tesla V100-SXM3-32GB-H (GPU0) -> Tesla V100-SXM3-32GB-H (GPU2) : Yes
> Peer access from Tesla V100-SXM3-32GB-H (GPU0) -> Tesla V100-SXM3-32GB-H (GPU3) : Yes
> Peer access from Tesla V100-SXM3-32GB-H (GPU0) -> Tesla V100-SXM3-32GB-H (GPU4) : Yes
> Peer access from Tesla V100-SXM3-32GB-H (GPU0) -> Tesla V100-SXM3-32GB-H (GPU5) : Yes
> Peer access from Tesla V100-SXM3-32GB-H (GPU0) -> Tesla V100-SXM3-32GB-H (GPU6) : Yes
> Peer access from Tesla V100-SXM3-32GB-H (GPU0) -> Tesla V100-SXM3-32GB-H (GPU7) : Yes
> Peer access from Tesla V100-SXM3-32GB-H (GPU0) -> Tesla V100-SXM3-32GB-H (GPU8) : Yes
> Peer access from Tesla V100-SXM3-32GB-H (GPU0) -> Tesla V100-SXM3-32GB-H (GPU9) : Yes
> Peer access from Tesla V100-SXM3-32GB-H (GPU0) -> Tesla V100-SXM3-32GB-H (GPU10) : Yes
> Peer access from Tesla V100-SXM3-32GB-H (GPU0) -> Tesla V100-SXM3-32GB-H (GPU11) : Yes
> Peer access from Tesla V100-SXM3-32GB-H (GPU0) -> Tesla V100-SXM3-32GB-H (GPU12) : Yes
> Peer access from Tesla V100-SXM3-32GB-H (GPU0) -> Tesla V100-SXM3-32GB-H (GPU13) : Yes
> Peer access from Tesla V100-SXM3-32GB-H (GPU0) -> Tesla V100-SXM3-32GB-H (GPU14) : Yes
> Peer access from Tesla V100-SXM3-32GB-H (GPU0) -> Tesla V100-SXM3-32GB-H (GPU15) : Yes
[snip]
> Peer access from Tesla V100-SXM3-32GB-H (GPU15) -> Tesla V100-SXM3-32GB-H (GPU0) : Yes
> Peer access from Tesla V100-SXM3-32GB-H (GPU15) -> Tesla V100-SXM3-32GB-H (GPU1) : Yes
> Peer access from Tesla V100-SXM3-32GB-H (GPU15) -> Tesla V100-SXM3-32GB-H (GPU2) : Yes
> Peer access from Tesla V100-SXM3-32GB-H (GPU15) -> Tesla V100-SXM3-32GB-H (GPU3) : Yes
> Peer access from Tesla V100-SXM3-32GB-H (GPU15) -> Tesla V100-SXM3-32GB-H (GPU4) : Yes
> Peer access from Tesla V100-SXM3-32GB-H (GPU15) -> Tesla V100-SXM3-32GB-H (GPU5) : Yes
> Peer access from Tesla V100-SXM3-32GB-H (GPU15) -> Tesla V100-SXM3-32GB-H (GPU6) : Yes
> Peer access from Tesla V100-SXM3-32GB-H (GPU15) -> Tesla V100-SXM3-32GB-H (GPU7) : Yes
> Peer access from Tesla V100-SXM3-32GB-H (GPU15) -> Tesla V100-SXM3-32GB-H (GPU8) : Yes
> Peer access from Tesla V100-SXM3-32GB-H (GPU15) -> Tesla V100-SXM3-32GB-H (GPU9) : Yes
> Peer access from Tesla V100-SXM3-32GB-H (GPU15) -> Tesla V100-SXM3-32GB-H (GPU10) : Yes
> Peer access from Tesla V100-SXM3-32GB-H (GPU15) -> Tesla V100-SXM3-32GB-H (GPU11) : Yes
> Peer access from Tesla V100-SXM3-32GB-H (GPU15) -> Tesla V100-SXM3-32GB-H (GPU12) : Yes
> Peer access from Tesla V100-SXM3-32GB-H (GPU15) -> Tesla V100-SXM3-32GB-H (GPU13) : Yes
> Peer access from Tesla V100-SXM3-32GB-H (GPU15) -> Tesla V100-SXM3-32GB-H (GPU14) : Yes

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.1, CUDA Runtime Version = 10.1, NumDevs = 16
Result = PASS

What is more interesting is the topology showing 6x NVLink connections between all of the GPUs:

# nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    GPU8    GPU9    GPU10   GPU11   GPU12   GPU13   GPU14   GPU15   mlx5_0  mlx5_1  mlx5_2  mlx5_3  mlx5_4     mlx5_5  mlx5_6  mlx5_7  mlx5_8  mlx5_9  CPU Affinity
GPU0     X      NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     PIX     PXB     NODE    NODE    SYSSYS     SYS     SYS     SYS     SYS     0-23,48-71
GPU1    NV6      X      NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     PIX     PXB     NODE    NODE    SYSSYS     SYS     SYS     SYS     SYS     0-23,48-71
GPU2    NV6     NV6      X      NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     PXB     PIX     NODE    NODE    SYSSYS     SYS     SYS     SYS     SYS     0-23,48-71
GPU3    NV6     NV6     NV6      X      NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     PXB     PIX     NODE    NODE    SYSSYS     SYS     SYS     SYS     SYS     0-23,48-71
GPU4    NV6     NV6     NV6     NV6      X      NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NODE    NODE    PIX     PXB     SYSSYS     SYS     SYS     SYS     SYS     0-23,48-71
GPU5    NV6     NV6     NV6     NV6     NV6      X      NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NODE    NODE    PIX     PXB     SYSSYS     SYS     SYS     SYS     SYS     0-23,48-71
GPU6    NV6     NV6     NV6     NV6     NV6     NV6      X      NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NODE    NODE    PXB     PIX     SYSSYS     SYS     SYS     SYS     SYS     0-23,48-71
GPU7    NV6     NV6     NV6     NV6     NV6     NV6     NV6      X      NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NODE    NODE    PXB     PIX     SYSSYS     SYS     SYS     SYS     SYS     0-23,48-71
GPU8    NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6      X      NV6     NV6     NV6     NV6     NV6     NV6     NV6     SYS     SYS     SYS     SYS     NODE       NODE    PIX     PXB     NODE    NODE    24-47,72-95
GPU9    NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6      X      NV6     NV6     NV6     NV6     NV6     NV6     SYS     SYS     SYS     SYS     NODE       NODE    PIX     PXB     NODE    NODE    24-47,72-95
GPU10   NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6      X      NV6     NV6     NV6     NV6     NV6     SYS     SYS     SYS     SYS     NODE       NODE    PXB     PIX     NODE    NODE    24-47,72-95
GPU11   NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6      X      NV6     NV6     NV6     NV6     SYS     SYS     SYS     SYS     NODE       NODE    PXB     PIX     NODE    NODE    24-47,72-95
GPU12   NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6      X      NV6     NV6     NV6     SYS     SYS     SYS     SYS     NODE       NODE    NODE    NODE    PIX     PXB     24-47,72-95
GPU13   NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6      X      NV6     NV6     SYS     SYS     SYS     SYS     NODE       NODE    NODE    NODE    PIX     PXB     24-47,72-95
GPU14   NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6      X      NV6     SYS     SYS     SYS     SYS     NODE       NODE    NODE    NODE    PXB     PIX     24-47,72-95
GPU15   NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6     NV6      X      SYS     SYS     SYS     SYS     NODE       NODE    NODE    NODE    PXB     PIX     24-47,72-95
mlx5_0  PIX     PIX     PXB     PXB     NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      PXB     NODE    NODE    SYSSYS     SYS     SYS     SYS     SYS
mlx5_1  PXB     PXB     PIX     PIX     NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PXB      X      NODE    NODE    SYSSYS     SYS     SYS     SYS     SYS
mlx5_2  NODE    NODE    NODE    NODE    PIX     PIX     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE     X      PXB     SYSSYS     SYS     SYS     SYS     SYS
mlx5_3  NODE    NODE    NODE    NODE    PXB     PXB     PIX     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE    PXB      X      SYSSYS     SYS     SYS     SYS     SYS
mlx5_4  SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS      X PIX     NODE    NODE    NODE    NODE
mlx5_5  SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     PIX X      NODE    NODE    NODE    NODE
mlx5_6  SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     PIX     PXB     PXB     NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE       NODE     X      PXB     NODE    NODE
mlx5_7  SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PXB     PXB     PIX     PIX     NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE       NODE    PXB      X      NODE    NODE
mlx5_8  SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    PIX     PIX     PXB     PXB     SYS     SYS     SYS     SYS     NODE       NODE    NODE    NODE     X      PXB
mlx5_9  SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    PXB     PXB     PIX     PIX     SYS     SYS     SYS     SYS     NODE       NODE    NODE    NODE    PXB      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing a single PCIe switch
  NV#  = Connection traversing a bonded set of # NVLinks

All that means that you end up with (for example in p2pBandwidthLatencyTest) things like this:

Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4      5      6      7      8      9     10     11     12     13     14     15
     0 916.96 263.93 265.66 265.20 265.38 268.34 265.47 266.01 263.52 262.59 265.99 265.99 265.64 265.64 267.72 268.18
     1 263.85 922.37 267.57 266.68 264.57 267.45 266.55 264.92 263.30 263.84 267.20 266.91 266.34 264.91 266.93 266.91
     2 265.11 265.20 922.37 267.80 265.29 269.30 266.19 269.50 264.87 264.63 268.37 268.09 265.64 267.33 269.66 268.93
     3 265.47 264.39 267.47 920.20 265.92 268.20 266.37 266.73 263.84 263.66 266.72 266.72 266.54 266.72 267.63 267.64
     4 265.28 264.48 267.60 265.65 916.96 269.04 265.47 267.30 264.63 264.70 267.98 265.63 265.45 264.91 268.13 268.21
     5 266.55 265.46 269.87 268.75 267.09 924.56 266.55 268.94 263.47 265.89 269.48 268.82 265.81 266.35 269.11 269.11
     6 266.01 264.39 267.28 266.65 264.39 267.10 924.56 265.65 264.01 265.14 267.09 266.90 265.82 266.72 266.90 266.72
     7 265.28 264.57 269.50 265.47 265.29 268.57 265.57 915.89 264.19 264.73 268.56 268.74 266.54 267.16 267.45 268.19
     8 263.63 263.53 266.84 266.34 263.56 266.90 266.18 266.30 922.37 264.03 266.78 266.97 266.04 264.75 266.56 267.11
     9 264.18 264.00 266.87 264.52 264.72 267.80 266.34 266.91 264.03 923.46 266.54 264.03 264.75 264.57 265.86 267.28
    10 266.59 264.89 268.16 265.41 265.44 269.47 267.46 268.90 265.29 264.67 925.65 267.56 266.37 267.10 269.12 268.94
    11 266.04 264.54 267.62 266.96 265.08 268.36 267.57 268.53 265.28 263.85 268.02 924.56 268.00 265.29 268.94 268.94
    12 266.10 263.82 267.07 266.71 265.86 267.25 265.44 266.16 264.03 264.38 267.83 267.75 924.56 266.01 266.55 266.73
    13 265.80 264.18 267.62 266.71 265.03 267.44 266.89 267.07 264.21 263.85 267.10 266.55 266.09 922.37 268.64 268.97
    14 265.61 265.02 268.92 266.52 265.98 269.09 265.62 267.75 264.92 263.32 269.12 267.10 265.47 266.28 923.46 268.76
    15 265.98 265.04 269.27 266.71 266.16 268.80 265.44 267.82 264.75 265.29 269.05 268.94 265.83 266.92 268.74 913.74