Conflict of using DPDK and model training on same machine

potterguang101 · April 16, 2020, 3:13am

Hi all,

I have suffered a problem when my use DPDK and do deep learning model training on same machine. I shared the same machine with other users, and they have to use DPDK on it. However, when I train deep learning model, such as ResNet50, using 2 node 4 GPUs, 2 GPUs per node, the utilization of second GPU will remain 0 during whole training process. Therefore, the training speed will drop. Besides, I also find decreasement of training performance, which means poor performances of loss on training dataset and accuracy on test dataset. After checking kernel information, I located the problem on intel-iommu. The problem of mutli-node multi-GPU model training will disapear after turning off the intel-iommu. However, DPDK can not work well after that. I wonder if there is a way to use DPDK and multi-node multi-GPU model training at the same machine or do you have met similar situation?

Topic		Replies	Views
Training a deep learning model on different machine Deep Learning (Training & Inference)	0	410	November 20, 2019
Performance Slowdown during Distributed Training with 4x RTX 4090 GPUs cuDNN cuda , pytorch , ai-training , gpu	6	4034	September 29, 2023
Performance downgrade with 2 GPUs working concurrently CUDA Setup and Installation	1	516	November 4, 2020
Performance downgrade with 2 GPUs working concurrently Linux	0	342	November 3, 2020
New Workshop: Data Parallelism: How to Train Deep Learning Models on Multiple GPUs Technical Blog	0	293	November 29, 2022
Training Multiple Models in one GPU in linux Frameworks	0	634	November 3, 2022
How to train a model with multiple GPUs TAO Toolkit	6	631	August 30, 2021
Multiple GPU very slow performance CUDA Programming and Performance	7	1227	November 10, 2022
Training multiple models on multiple GPUs hangs Frameworks pytorch	0	813	February 19, 2021
The cudaMemcpy speed between two A100Gpus is slow CUDA Programming and Performance	1	656	December 22, 2021

Conflict of using DPDK and model training on same machine

Related topics