I do not have physical access to the hardware. Is there a way to disable all peer-to-peer communication (PCIe, NVLink) at the OS level so that no software will use it? (wherein “software” means mostly PyTorch, Tensorflow, etc)
This is for a test, not for actual production.
EDIT: Alternatively, if that’s not doable, can I disable just NVLink from within the OS, so all P2P goes over PCIe?
I dont think so, NVBIT is an instrumentation library (.so) you would just need to LD_PRELOAD before you kickoff PyTorch. There are some starter project libraries at Releases · NVlabs/NVBit · GitHub for examples. They have a good paper, and a header file that explains all the hooks. With this I dont think you need to know the underlying application.