Docker instance based usage analytics on a DGX machine

Hello Team,

I have setup docker on my DGX machine and have been using multiple docker sessions for different tasks.
I want to evaluate the GPU usage per docker session for each user. How can I do it?
nvidia-smi is only helpful for overall usage statistics. The login is similar for each user and per docker session I need to evaluate how much consumption is there. Could you please suggest some solution?

Hi @Rudra ,

Are the users sharing the GPUs? Meaning, you have two concurrent Docker sessions going both with GPU0?

Hello @ScottEllis,

Yes. The users are sharing the GPUs.

Using the following command, each team-member creates a docker session and run their AI pipelines in it.

docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=4,5 -v /path/users/name:/work --memory="32g" -p 5000:8888 -dit --rm --name mydocker nvcr.io/nvidia/tensorflow:20.03-tf2-py3

The only bottleneck is I am not able to track their individual GPU usage (and analytics).

It sounds like you and your team are at the point where manually doing a docker run ... is going to mean increasingly more pain - e.g., if you want to monitor usage, jobs, etc. there are easier ways.

What about changing the jobs to something like Slurm, and then you can use all of the Slurm analytics to capture and control usage? Deploying Rich Cluster API on DGX for Multi-User Sharing | NVIDIA Developer Blog is a “does not need anything else” quick-and-dirty way to deploy it.

ScottE

1 Like

Hello @ScottEllis,

Thank you for sharing this. Let me check this out if it helps : )
Best,