Hi, my pod (16 H100) inference speed dropped from 30 tokens/s to 10 tokens/s . while in the dmesg, got huge information like :
[235715.381986] NVRM: knvlinkDiscoverPostRxDetLinks_GH100: Input mask contains a GPU on which NVLink is disabled.
[235715.382022] NVRM: knvlinkDiscoverPostRxDetLinks_GH100: Input mask contains a GPU on which NVLink is disabled.
[235715.382053] NVRM: knvlinkDiscoverPostRxDetLinks_GH100: Input mask contains a GPU on which NVLink is disabled.
[235715.382085] NVRM: knvlinkDiscoverPostRxDetLinks_GH100: Input mask contains a GPU on which NVLink is disabled.
[235716.650295] NVRM: knvlinkDiscoverPostRxDetLinks_GH100: Input mask contains a GPU on which NVLink is disabled.
[235716.650348] NVRM: knvlinkDiscoverPostRxDetLinks_GH100: Input mask contains a GPU on which NVLink is disabled.
[235716.650380] NVRM: knvlinkDiscoverPostRxDetLinks_GH100: Input mask contains a GPU on which NVLink is disabled.
[235716.650411] NVRM: knvlinkDiscoverPostRxDetLinks_GH100: Input mask contains a GPU on which NVLink is disabled.
it seems every seconds there are 5 more this message. I cant find any information from google. please help me