How to Monitor and Optimize GPU and CPU Resource Usage?

mengbozhou1106 · November 11, 2023, 1:32am

Hello everyone,

I have been training and fine-tuning large language models using PyTorch recently. During this process, I am looking to better understand and monitor the inter-GPU communication, the transfer of parameters and operators, as well as the usage of GPU memory and CPU memory.

Specifically, I am facing the following challenges:

How can I monitor the communication and data transfer in a multi-GPU environment?
What are the best practices to accurately measure and optimize the memory usage of the GPU in my model?
Are there tools or techniques that can help me better monitor the usage of CPU memory during training?

I have tried using nvidia-smi to observe GPU usage, but I find the information provided somewhat limited. I am looking for more detailed analysis, especially in a distributed training context.

Topic		Replies	Views
Analyze the gpu processing for during training Frameworks pytorch	0	375	August 20, 2020
GPU memory requirements during training TAO Toolkit	11	973	July 20, 2022
Training a gan on the RTX 30370 with only ~5% utilization CUDA Programming and Performance	3	639	February 11, 2021
CUDA Resource monitor how to monitor what my program is doing! CUDA Programming and Performance	4	13008	December 29, 2009
Pytorch NLP model doesn't use GPU when making inference Deep Learning (Training & Inference) pytorch	0	415	September 14, 2020
Training Multiple Models in one GPU in linux Frameworks	0	644	November 3, 2022
torch.OutOfMemoryError: CUDA out of memory when training model Linux pytorch , ai-training , training , natural-language-processing-nlp , ai-model-training	0	627	January 6, 2025
How can I see GPU memory used ? Jetson TX2	8	7985	October 18, 2021
Optimize fine tuning of a Citrinet model in multi GPU environment Frameworks nemo	0	762	October 28, 2021
Discrepancy when profiling GPU memory utilization CUDA Programming and Performance	0	535	December 4, 2018

How to Monitor and Optimize GPU and CPU Resource Usage?

Related topics