Training Multiple Models in one GPU in linux

mariano13 · November 3, 2022, 12:49am

Hi,
my organization has a computer cluster with Linux as their OS. Each node has a single A100 GPU on it.

I want to know what are the issues of training multiple models in the same node. Basically, I follow these steps:

Log in into the node. Screen or TMUX twice to instantiate two linux shells in the node.
In each shell, I run a python script that uses PyTorch with GPU support to train a model.
The models are independent and the processes don’t talk to each other at all. Each model uses 20% of the GPU memory and 27% of GPU-util as reported by running Nvidia-smi on the node.

Questions:

Is there any way of doing this more efficiently or is this the best way of doing it?
What can I read to understand how the GPU handles this concurrent processing?
How is the GPU organizing the tasks submitted by each shell? Fully parallel or sequential ?

Thanks !

Topic		Replies	Views
How to Monitor and Optimize GPU and CPU Resource Usage? GPU - Hardware cuda	0	586	November 11, 2023
Multiple GPU very slow performance CUDA Programming and Performance	7	1225	November 10, 2022
How to train my model on multiple GPU CUDA Programming and Performance gpu , rtx	2	467	March 11, 2024
How to use multi-GPUs on a single mechine to run the cases in Modulus Technical Support (Modulus Only)	7	1161	June 4, 2023
Enabling multiple GPUs Technical Support (Modulus Only) gpu	1	1368	March 29, 2023
GPU resource needed for training 10000 models Frameworks tensorflow	2	463	January 20, 2021
Sample programs of single-node multi GPU and multinode multi GPU CUDA CUDA Programming and Performance	2	118	July 15, 2024
Using GPUs on high performance machines CUDA Programming and Performance	4	1056	February 8, 2013
Performance downgrade with 2 GPUs working concurrently Linux	0	342	November 3, 2020
CUDA using Multiple devices CUDA Programming and Performance	5	3116	June 22, 2009