Hi all,
I have suffered a problem when my use DPDK and do deep learning model training on same machine. I shared the same machine with other users, and they have to use DPDK on it. However, when I train deep learning model, such as ResNet50, using 2 node 4 GPUs, 2 GPUs per node, the utilization of second GPU will remain 0 during whole training process. Therefore, the training speed will drop. Besides, I also find decreasement of training performance, which means poor performances of loss on training dataset and accuracy on test dataset. After checking kernel information, I located the problem on intel-iommu. The problem of mutli-node multi-GPU model training will disapear after turning off the intel-iommu. However, DPDK can not work well after that. I wonder if there is a way to use DPDK and multi-node multi-GPU model training at the same machine or do you have met similar situation?