my organization has a computer cluster with Linux as their OS. Each node has a single A100 GPU on it.
I want to know what are the issues of training multiple models in the same node. Basically, I follow these steps:
- Log in into the node. Screen or TMUX twice to instantiate two linux shells in the node.
- In each shell, I run a python script that uses PyTorch with GPU support to train a model.
The models are independent and the processes don’t talk to each other at all. Each model uses 20% of the GPU memory and 27% of GPU-util as reported by running Nvidia-smi on the node.
- Is there any way of doing this more efficiently or is this the best way of doing it?
- What can I read to understand how the GPU handles this concurrent processing?
- How is the GPU organizing the tasks submitted by each shell? Fully parallel or sequential ?