CUDA initialization error on 8x A100 GPU HGX server

I am new and I am having CUDA initialization error when I tried to set up my first 8x A100 GPU HGX server(running RHEL7.9). can’t find “nvswitches”. Could you please advise how I can troubleshoot and fix the problem? Thank you so much!!

./deviceQuery

./deviceQuery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 3
→ initialization error
Result = FAIL

nvidia-smi

Sat Apr 22 10:18:26 2023
±--------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-80GB On | 00000000:07:00.0 Off | 0 |
| N/A 31C P0 61W / 400W| 0MiB / 81920MiB | 0% Default |
| | | Disabled |
±----------------------------------------±---------------------±---------------------+
| 1 NVIDIA A100-SXM4-80GB On | 00000000:0A:00.0 Off | 0 |
| N/A 29C P0 61W / 400W| 0MiB / 81920MiB | 0% Default |
| | | Disabled |
±----------------------------------------±---------------------±---------------------+
| 2 NVIDIA A100-SXM4-80GB On | 00000000:44:00.0 Off | 0 |
| N/A 29C P0 59W / 400W| 0MiB / 81920MiB | 0% Default |
| | | Disabled |
±----------------------------------------±---------------------±---------------------+
| 3 NVIDIA A100-SXM4-80GB On | 00000000:4A:00.0 Off | 0 |
| N/A 32C P0 60W / 400W| 0MiB / 81920MiB | 0% Default |
| | | Disabled |
±----------------------------------------±---------------------±---------------------+
| 4 NVIDIA A100-SXM4-80GB On | 00000000:84:00.0 Off | 0 |
| N/A 31C P0 51W / 400W| 0MiB / 81920MiB | 0% Default |
| | | Disabled |
±----------------------------------------±---------------------±---------------------+
| 5 NVIDIA A100-SXM4-80GB On | 00000000:8A:00.0 Off | 0 |
| N/A 29C P0 61W / 400W| 0MiB / 81920MiB | 0% Default |
| | | Disabled |
±----------------------------------------±---------------------±---------------------+
| 6 NVIDIA A100-SXM4-80GB On | 00000000:C0:00.0 Off | 0 |
| N/A 29C P0 61W / 400W| 0MiB / 81920MiB | 0% Default |
| | | Disabled |
±----------------------------------------±---------------------±---------------------+
| 7 NVIDIA A100-SXM4-80GB On | 00000000:C3:00.0 Off | 0 |
| N/A 32C P0 62W / 400W| 0MiB / 81920MiB | 0% Default |
| | | Disabled |
±----------------------------------------±---------------------±---------------------+

±--------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
±--------------------------------------------------------------------------------------+

dcgmi diag -r 3

Successfully ran diagnostic for group.
±--------------------------±-----------------------------------------------+
| Diagnostic | Result |
+===========================+================================================+
|----- Metadata ----------±-----------------------------------------------|
| DCGM Version | 3.1.7 |
| Driver Version Detected | 530.30.02 |
| GPU Device IDs Detected | 20b2,20b2,20b2,20b2,20b2,20b2,20b2,20b2 |
|----- Deployment --------±-----------------------------------------------|
| Denylist | Pass |
| NVML Library | Pass |
| CUDA Main Library | Pass |
| Permissions and OS Blocks | Pass |
| Persistence Mode | Pass |
| Environment Variables | Pass |
| Page Retirement/Row Remap | Pass |
| Graphics Processes | Pass |
| Inforom | Pass |
±---- Integration -------±-----------------------------------------------+
| PCIe | Fail - All |
| Warning | GPU 0 Error using CUDA API cudaDeviceGetByPCI |
| | BusId ‘initialization error’ for GPU 0, bus I |
| | D = 00000000:07:00.0 |
±---- Hardware ----------±-----------------------------------------------+
| GPU Memory | Fail - All |
| Warning | GPU 0 Error using CUDA API cuInit Unable to i |
| | nitialize CUDA library: 'initialization error |
| | ‘.; verify that the fabric-manager has been s |
| | tarted if applicable, GPU 0 Error using CUDA |
| | API cuInit Unable to initialize CUDA library: |
| | ‘initialization error’.; verify that the fab |
| | ric-manager has been started if applicable, G |
| | PU 0 Error using CUDA API cuInit Unable to in |
| | itialize CUDA library: ‘initialization error’ |
| | .; verify that the fabric-manager has been st |
| | arted if applicable, GPU 0 Error using CUDA A |
| | PI cuInit Unable to initialize CUDA library: |
| | ‘initialization error’.; verify that the fabr |
| | ic-manager has been started if applicable, GP |
| | U 0 Error using CUDA API cuInit Unable to ini |
| | tialize CUDA library: ‘initialization error’. |
| | ; verify that the fabric-manager has been sta |
| | rted if applicable, GPU 0 Error using CUDA AP |
| | I cuInit Unable to initialize CUDA library: ’ |
| | initialization error’.; verify that the fabri |
| | c-manager has been started if applicable, GPU |
| | 0 Error using CUDA API cuInit Unable to init |
| | ialize CUDA library: 'initializat |
| Warning | GPU 1 Error using CUDA API cuInit Unable to i |
| | nitialize CUDA library: 'initialization error |
| | ‘.; verify that the fabric-manager has been s |
| | tarted if applicable, GPU 1 Error using CUDA |
| | API cuInit Unable to initialize CUDA library: |
| | ‘initialization error’.; verify that the fab |
| | ric-manager has been started if applicable, G |
| | PU 1 Error using CUDA API cuInit Unable to in |
| | itialize CUDA library: ‘initialization error’ |
| | .; verify that the fabric-manager has been st |
| | arted if applicable, GPU 1 Error using CUDA A |
| | PI cuInit Unable to initialize CUDA library: |
| | ‘initialization error’.; verify that the fabr |
| | ic-manager has been started if applicable, GP |
| | U 1 Error using CUDA API cuInit Unable to ini |
| | tialize CUDA library: ‘initialization error’. |
| | ; verify that the fabric-manager has been sta |
| | rted if applicable, GPU 1 Error using CUDA AP |
| | I cuInit Unable to initialize CUDA library: ’ |
| | initialization error’.; verify that the fabri |
| | c-manager has been started if applicable, GPU |
| | 1 Error using CUDA API cuInit Unable to init |
| | ialize CUDA library: 'initializat |
| Warning | GPU 2 Error using CUDA API cuInit Unable to i |
| | nitialize CUDA library: 'initialization error |
| | ‘.; verify that the fabric-manager has been s |
| | tarted if applicable, GPU 2 Error using CUDA |
| | API cuInit Unable to initialize CUDA library: |
| | ‘initialization error’.; verify that the fab |
| | ric-manager has been started if applicable, G |
| | PU 2 Error using CUDA API cuInit Unable to in |
| | itialize CUDA library: ‘initialization error’ |
| | .; verify that the fabric-manager has been st |
| | arted if applicable, GPU 2 Error using CUDA A |
| | PI cuInit Unable to initialize CUDA library: |
| | ‘initialization error’.; verify that the fabr |
| | ic-manager has been started if applicable, GP |
| | U 2 Error using CUDA API cuInit Unable to ini |
| | tialize CUDA library: ‘initialization error’. |
| | ; verify that the fabric-manager has been sta |
| | rted if applicable, GPU 2 Error using CUDA AP |
| | I cuInit Unable to initialize CUDA library: ’ |
| | initialization error’.; verify that the fabri |
| | c-manager has been started if applicable, GPU |
| | 2 Error using CUDA API cuInit Unable to init |
| | ialize CUDA library: 'initializat |
| Warning | GPU 3 Error using CUDA API cuInit Unable to i |
| | nitialize CUDA library: 'initialization error |
| | ‘.; verify that the fabric-manager has been s |
| | tarted if applicable, GPU 3 Error using CUDA |
| | API cuInit Unable to initialize CUDA library: |
| | ‘initialization error’.; verify that the fab |
| | ric-manager has been started if applicable, G |
| | PU 3 Error using CUDA API cuInit Unable to in |
| | itialize CUDA library: ‘initialization error’ |
| | .; verify that the fabric-manager has been st |
| | arted if applicable, GPU 3 Error using CUDA A |
| | PI cuInit Unable to initialize CUDA library: |
| | ‘initialization error’.; verify that the fabr |
| | ic-manager has been started if applicable, GP |
| | U 3 Error using CUDA API cuInit Unable to ini |
| | tialize CUDA library: ‘initialization error’. |
| | ; verify that the fabric-manager has been sta |
| | rted if applicable, GPU 3 Error using CUDA AP |
| | I cuInit Unable to initialize CUDA library: ’ |
| | initialization error’.; verify that the fabri |
| | c-manager has been started if applicable, GPU |
| | 3 Error using CUDA API cuInit Unable to init |
| | ialize CUDA library: 'initializat |
| Warning | GPU 4 Error using CUDA API cuInit Unable to i |
| | nitialize CUDA library: 'initialization error |
| | ‘.; verify that the fabric-manager has been s |
| | tarted if applicable, GPU 4 Error using CUDA |
| | API cuInit Unable to initialize CUDA library: |
| | ‘initialization error’.; verify that the fab |
| | ric-manager has been started if applicable, G |
| | PU 4 Error using CUDA API cuInit Unable to in |
| | itialize CUDA library: ‘initialization error’ |
| | .; verify that the fabric-manager has been st |
| | arted if applicable, GPU 4 Error using CUDA A |
| | PI cuInit Unable to initialize CUDA library: |
| | ‘initialization error’.; verify that the fabr |
| | ic-manager has been started if applicable, GP |
| | U 4 Error using CUDA API cuInit Unable to ini |
| | tialize CUDA library: ‘initialization error’. |
| | ; verify that the fabric-manager has been sta |
| | rted if applicable, GPU 4 Error using CUDA AP |
| | I cuInit Unable to initialize CUDA library: ’ |
| | initialization error’.; verify that the fabri |
| | c-manager has been started if applicable, GPU |
| | 4 Error using CUDA API cuInit Unable to init |
| | ialize CUDA library: 'initializat |
| Warning | GPU 5 Error using CUDA API cuInit Unable to i |
| | nitialize CUDA library: 'initialization error |
| | ‘.; verify that the fabric-manager has been s |
| | tarted if applicable, GPU 5 Error using CUDA |
| | API cuInit Unable to initialize CUDA library: |
| | ‘initialization error’.; verify that the fab |
| | ric-manager has been started if applicable, G |
| | PU 5 Error using CUDA API cuInit Unable to in |
| | itialize CUDA library: ‘initialization error’ |
| | .; verify that the fabric-manager has been st |
| | arted if applicable, GPU 5 Error using CUDA A |
| | PI cuInit Unable to initialize CUDA library: |
| | ‘initialization error’.; verify that the fabr |
| | ic-manager has been started if applicable, GP |
| | U 5 Error using CUDA API cuInit Unable to ini |
| | tialize CUDA library: ‘initialization error’. |
| | ; verify that the fabric-manager has been sta |
| | rted if applicable, GPU 5 Error using CUDA AP |
| | I cuInit Unable to initialize CUDA library: ’ |
| | initialization error’.; verify that the fabri |
| | c-manager has been started if applicable, GPU |
| | 5 Error using CUDA API cuInit Unable to init |
| | ialize CUDA library: 'initializat |
| Warning | GPU 6 Error using CUDA API cuInit Unable to i |
| | nitialize CUDA library: 'initialization error |
| | ‘.; verify that the fabric-manager has been s |
| | tarted if applicable, GPU 6 Error using CUDA |
| | API cuInit Unable to initialize CUDA library: |
| | ‘initialization error’.; verify that the fab |
| | ric-manager has been started if applicable, G |
| | PU 6 Error using CUDA API cuInit Unable to in |
| | itialize CUDA library: ‘initialization error’ |
| | .; verify that the fabric-manager has been st |
| | arted if applicable, GPU 6 Error using CUDA A |
| | PI cuInit Unable to initialize CUDA library: |
| | ‘initialization error’.; verify that the fabr |
| | ic-manager has been started if applicable, GP |
| | U 6 Error using CUDA API cuInit Unable to ini |
| | tialize CUDA library: ‘initialization error’. |
| | ; verify that the fabric-manager has been sta |
| | rted if applicable, GPU 6 Error using CUDA AP |
| | I cuInit Unable to initialize CUDA library: ’ |
| | initialization error’.; verify that the fabri |
| | c-manager has been started if applicable, GPU |
| | 6 Error using CUDA API cuInit Unable to init |
| | ialize CUDA library: 'initializat |
| Warning | GPU 7 Error using CUDA API cuInit Unable to i |
| | nitialize CUDA library: 'initialization error |
| | ‘.; verify that the fabric-manager has been s |
| | tarted if applicable, GPU 7 Error using CUDA |
| | API cuInit Unable to initialize CUDA library: |
| | ‘initialization error’.; verify that the fab |
| | ric-manager has been started if applicable, G |
| | PU 7 Error using CUDA API cuInit Unable to in |
| | itialize CUDA library: ‘initialization error’ |
| | .; verify that the fabric-manager has been st |
| | arted if applicable, GPU 7 Error using CUDA A |
| | PI cuInit Unable to initialize CUDA library: |
| | ‘initialization error’.; verify that the fabr |
| | ic-manager has been started if applicable, GP |
| | U 7 Error using CUDA API cuInit Unable to ini |
| | tialize CUDA library: ‘initialization error’. |
| | ; verify that the fabric-manager has been sta |
| | rted if applicable, GPU 7 Error using CUDA AP |
| | I cuInit Unable to initialize CUDA library: ’ |
| | initialization error’.; verify that the fabri |
| | c-manager has been started if applicable, GPU |
| | 7 Error using CUDA API cuInit Unable to init |
| | ialize CUDA library: 'initializat |
| Diagnostic | Fail - All |
| Warning | GPU 0 API call cudaDeviceGetByPCIBusId failed |
| | for GPU 0: ‘initialization error’, GPU 0 API |
| | call cudaDeviceGetByPCIBusId failed for GPU |
| | 1: ‘initialization error’, GPU 0 API call cud |
| | aDeviceGetByPCIBusId failed for GPU 2: ‘initi |
| | alization error’, GPU 0 API call cudaDeviceGe |
| | tByPCIBusId failed for GPU 3: ‘initialization |
| | error’, GPU 0 API call cudaDeviceGetByPCIBus |
| | Id failed for GPU 4: ‘initialization error’, |
| | GPU 0 API call cudaDeviceGetByPCIBusId failed |
| | for GPU 5: ‘initialization error’, GPU 0 API |
| | call cudaDeviceGetByPCIBusId failed for GPU |
| | 6: ‘initialization error’, GPU 0 API call cud |
| | aDeviceGetByPCIBusId failed for GPU 7: ‘initi |
| | alization error’, GPU 0 There was an internal |
| | error during the test: ‘Failed to initialize |
| | the plugin.’, GPU 0 Error using CUDA API cud |
| | aDeviceGetByPCIBusId ‘initialization error’ f |
| | or GPU 0, bus ID = 00000000:07:00.0 |
| Warning | GPU 1 API call cudaDeviceGetByPCIBusId failed |
| | for GPU 0: ‘initialization error’, GPU 1 API |
| | call cudaDeviceGetByPCIBusId failed for GPU |
| | 1: ‘initialization error’, GPU 1 API call cud |
| | aDeviceGetByPCIBusId failed for GPU 2: ‘initi |
| | alization error’, GPU 1 API call cudaDeviceGe |
| | tByPCIBusId failed for GPU 3: ‘initialization |
| | error’, GPU 1 API call cudaDeviceGetByPCIBus |
| | Id failed for GPU 4: ‘initialization error’, |
| | GPU 1 API call cudaDeviceGetByPCIBusId failed |
| | for GPU 5: ‘initialization error’, GPU 1 API |
| | call cudaDeviceGetByPCIBusId failed for GPU |
| | 6: ‘initialization error’, GPU 1 API call cud |
| | aDeviceGetByPCIBusId failed for GPU 7: ‘initi |
| | alization error’, GPU 1 There was an internal |
| | error during the test: ‘Failed to initialize |
| | the plugin.’ |
| Warning | GPU 2 API call cudaDeviceGetByPCIBusId failed |
| | for GPU 0: ‘initialization error’, GPU 2 API |
| | call cudaDeviceGetByPCIBusId failed for GPU |
| | 1: ‘initialization error’, GPU 2 API call cud |
| | aDeviceGetByPCIBusId failed for GPU 2: ‘initi |
| | alization error’, GPU 2 API call cudaDeviceGe |
| | tByPCIBusId failed for GPU 3: ‘initialization |
| | error’, GPU 2 API call cudaDeviceGetByPCIBus |
| | Id failed for GPU 4: ‘initialization error’, |
| | GPU 2 API call cudaDeviceGetByPCIBusId failed |
| | for GPU 5: ‘initialization error’, GPU 2 API |
| | call cudaDeviceGetByPCIBusId failed for GPU |
| | 6: ‘initialization error’, GPU 2 API call cud |
| | aDeviceGetByPCIBusId failed for GPU 7: ‘initi |
| | alization error’, GPU 2 There was an internal |
| | error during the test: ‘Failed to initialize |
| | the plugin.’ |
| Warning | GPU 3 API call cudaDeviceGetByPCIBusId failed |
| | for GPU 0: ‘initialization error’, GPU 3 API |
| | call cudaDeviceGetByPCIBusId failed for GPU |
| | 1: ‘initialization error’, GPU 3 API call cud |
| | aDeviceGetByPCIBusId failed for GPU 2: ‘initi |
| | alization error’, GPU 3 API call cudaDeviceGe |
| | tByPCIBusId failed for GPU 3: ‘initialization |
| | error’, GPU 3 API call cudaDeviceGetByPCIBus |
| | Id failed for GPU 4: ‘initialization error’, |
| | GPU 3 API call cudaDeviceGetByPCIBusId failed |
| | for GPU 5: ‘initialization error’, GPU 3 API |
| | call cudaDeviceGetByPCIBusId failed for GPU |
| | 6: ‘initialization error’, GPU 3 API call cud |
| | aDeviceGetByPCIBusId failed for GPU 7: ‘initi |
| | alization error’, GPU 3 There was an internal |
| | error during the test: ‘Failed to initialize |
| | the plugin.’ |
| Warning | GPU 4 API call cudaDeviceGetByPCIBusId failed |
| | for GPU 0: ‘initialization error’, GPU 4 API |
| | call cudaDeviceGetByPCIBusId failed for GPU |
| | 1: ‘initialization error’, GPU 4 API call cud |
| | aDeviceGetByPCIBusId failed for GPU 2: ‘initi |
| | alization error’, GPU 4 API call cudaDeviceGe |
| | tByPCIBusId failed for GPU 3: ‘initialization |
| | error’, GPU 4 API call cudaDeviceGetByPCIBus |
| | Id failed for GPU 4: ‘initialization error’, |
| | GPU 4 API call cudaDeviceGetByPCIBusId failed |
| | for GPU 5: ‘initialization error’, GPU 4 API |
| | call cudaDeviceGetByPCIBusId failed for GPU |
| | 6: ‘initialization error’, GPU 4 API call cud |
| | aDeviceGetByPCIBusId failed for GPU 7: ‘initi |
| | alization error’, GPU 4 There was an internal |
| | error during the test: ‘Failed to initialize |
| | the plugin.’, GPU 4 Clocks are being throttl |
| | ed for GPU 4 because of clock throttling star |
| | ting 8.2 seconds into the test. clocks_thrott |
| | le_reason_hw_slowdown: either the temperature |
| | is too high or there is a power supply probl |
| | em (the power brake assertion has been trippe |
| | d). |
| Warning | GPU 5 API call cudaDeviceGetByPCIBusId failed |
| | for GPU 0: ‘initialization error’, GPU 5 API |
| | call cudaDeviceGetByPCIBusId failed for GPU |
| | 1: ‘initialization error’, GPU 5 API call cud |
| | aDeviceGetByPCIBusId failed for GPU 2: ‘initi |
| | alization error’, GPU 5 API call cudaDeviceGe |
| | tByPCIBusId failed for GPU 3: ‘initialization |
| | error’, GPU 5 API call cudaDeviceGetByPCIBus |
| | Id failed for GPU 4: ‘initialization error’, |
| | GPU 5 API call cudaDeviceGetByPCIBusId failed |
| | for GPU 5: ‘initialization error’, GPU 5 API |
| | call cudaDeviceGetByPCIBusId failed for GPU |
| | 6: ‘initialization error’, GPU 5 API call cud |
| | aDeviceGetByPCIBusId failed for GPU 7: ‘initi |
| | alization error’, GPU 5 There was an internal |
| | error during the test: ‘Failed to initialize |
| | the plugin.’ |
| Warning | GPU 6 API call cudaDeviceGetByPCIBusId failed |
| | for GPU 0: ‘initialization error’, GPU 6 API |
| | call cudaDeviceGetByPCIBusId failed for GPU |
| | 1: ‘initialization error’, GPU 6 API call cud |
| | aDeviceGetByPCIBusId failed for GPU 2: ‘initi |
| | alization error’, GPU 6 API call cudaDeviceGe |
| | tByPCIBusId failed for GPU 3: ‘initialization |
| | error’, GPU 6 API call cudaDeviceGetByPCIBus |
| | Id failed for GPU 4: ‘initialization error’, |
| | GPU 6 API call cudaDeviceGetByPCIBusId failed |
| | for GPU 5: ‘initialization error’, GPU 6 API |
| | call cudaDeviceGetByPCIBusId failed for GPU |
| | 6: ‘initialization error’, GPU 6 API call cud |
| | aDeviceGetByPCIBusId failed for GPU 7: ‘initi |
| | alization error’, GPU 6 There was an internal |
| | error during the test: ‘Failed to initialize |
| | the plugin.’ |
| Warning | GPU 7 API call cudaDeviceGetByPCIBusId failed |
| | for GPU 0: ‘initialization error’, GPU 7 API |
| | call cudaDeviceGetByPCIBusId failed for GPU |
| | 1: ‘initialization error’, GPU 7 API call cud |
| | aDeviceGetByPCIBusId failed for GPU 2: ‘initi |
| | alization error’, GPU 7 API call cudaDeviceGe |
| | tByPCIBusId failed for GPU 3: ‘initialization |
| | error’, GPU 7 API call cudaDeviceGetByPCIBus |
| | Id failed for GPU 4: ‘initialization error’, |
| | GPU 7 API call cudaDeviceGetByPCIBusId failed |
| | for GPU 5: ‘initialization error’, GPU 7 API |
| | call cudaDeviceGetByPCIBusId failed for GPU |
| | 6: ‘initialization error’, GPU 7 API call cud |
| | aDeviceGetByPCIBusId failed for GPU 7: ‘initi |
| | alization error’, GPU 7 There was an internal |
| | error during the test: ‘Failed to initialize |
| | the plugin.’ |
±---- Stress ------------±-----------------------------------------------+
| Memory Bandwidth | Fail - All |
| Warning | GPU 0 API call cuInit failed for GPU 0: ‘init |
| | ialization error; verify that the fabric-mana |
| | ger has been started if applicable’ |
| Warning | GPU 1 API call cuInit failed for GPU 0: ‘init |
| | ialization error; verify that the fabric-mana |
| | ger has been started if applicable’ |
| Warning | GPU 2 API call cuInit failed for GPU 0: ‘init |
| | ialization error; verify that the fabric-mana |
| | ger has been started if applicable’ |
| Warning | GPU 3 API call cuInit failed for GPU 0: ‘init |
| | ialization error; verify that the fabric-mana |
| | ger has been started if applicable’ |
| Warning | GPU 4 API call cuInit failed for GPU 0: ‘init |
| | ialization error; verify that the fabric-mana |
| | ger has been started if applicable’, GPU 4 Cl |
| | ocks are being throttled for GPU 4 because of |
| | clock throttling starting 8.2 seconds into t |
| | he test. clocks_throttle_reason_hw_slowdown: |
| | either the temperature is too high or there i |
| | s a power supply problem (the power brake ass |
| | ertion has been tripped). |
| Warning | GPU 5 API call cuInit failed for GPU 0: ‘init |
| | ialization error; verify that the fabric-mana |
| | ger has been started if applicable’ |
| Warning | GPU 6 API call cuInit failed for GPU 0: ‘init |
| | ialization error; verify that the fabric-mana |
| | ger has been started if applicable’ |
| Warning | GPU 7 API call cuInit failed for GPU 0: ‘init |
| | ialization error; verify that the fabric-mana |
| | ger has been started if applicable’ |
| EUD Test | Skip - All |
±--------------------------±-----------------------------------------------+

dcgmi discovery -l

8 GPUs found.
±-------±---------------------------------------------------------------------+
| GPU ID | Device Information |
±-------±---------------------------------------------------------------------+
| 0 | Name: NVIDIA A100-SXM4-80GB |
| | PCI Bus ID: 00000000:07:00.0 |
| | Device UUID: GPU-9c245d1a-2c6f-a7d6-b91e-e18f6ba6476e |
±-------±---------------------------------------------------------------------+
| 1 | Name: NVIDIA A100-SXM4-80GB |
| | PCI Bus ID: 00000000:0A:00.0 |
| | Device UUID: GPU-e33addb3-e24d-e616-cbd4-309f29023f5e |
±-------±---------------------------------------------------------------------+
| 2 | Name: NVIDIA A100-SXM4-80GB |
| | PCI Bus ID: 00000000:44:00.0 |
| | Device UUID: GPU-0eb39c07-6f34-99f2-d9b8-a45ff0d18205 |
±-------±---------------------------------------------------------------------+
| 3 | Name: NVIDIA A100-SXM4-80GB |
| | PCI Bus ID: 00000000:4A:00.0 |
| | Device UUID: GPU-96afb7d3-7126-4335-2142-dc31b3c6c300 |
±-------±---------------------------------------------------------------------+
| 4 | Name: NVIDIA A100-SXM4-80GB |
| | PCI Bus ID: 00000000:84:00.0 |
| | Device UUID: GPU-61f669d9-b2ca-6bb4-b89e-b705e7f697a9 |
±-------±---------------------------------------------------------------------+
| 5 | Name: NVIDIA A100-SXM4-80GB |
| | PCI Bus ID: 00000000:8A:00.0 |
| | Device UUID: GPU-9887433e-1b65-69bc-7cfa-ffa18a6a614b |
±-------±---------------------------------------------------------------------+
| 6 | Name: NVIDIA A100-SXM4-80GB |
| | PCI Bus ID: 00000000:C0:00.0 |
| | Device UUID: GPU-ff993969-54b7-7ebd-aaf3-648657faab95 |
±-------±---------------------------------------------------------------------+
| 7 | Name: NVIDIA A100-SXM4-80GB |
| | PCI Bus ID: 00000000:C3:00.0 |
| | Device UUID: GPU-c6f5c562-fd22-3ccd-333e-4d5f1e4d8828 |
±-------±---------------------------------------------------------------------+
0 NvSwitches found.
±----------+
| Switch ID |
±----------+
±----------+

dcgmi nvlink -s

±---------------------+
| NvLink Link Status |
±---------------------+
GPUs:
gpuId 0:
U U U U U U U U U U U U _ _ _ _ _ _
gpuId 1:
U U U U U U U U U U U U _ _ _ _ _ _
gpuId 2:
U U U U U U U U U U U U _ _ _ _ _ _
gpuId 3:
U U U U U U U U U U U U _ _ _ _ _ _
gpuId 4:
D D D D D D D D D D D D _ _ _ _ _ _
gpuId 5:
U U U U U U U U U U U U _ _ _ _ _ _
gpuId 6:
U U U U U U U U U U U U _ _ _ _ _ _
gpuId 7:
U U U U U U U U U U U U _ _ _ _ _ _
NvSwitches:
No NvSwitches found.

Key: Up=U, Down=D, Disabled=X, Not Supported=_

log message from fabricmanager.log:

[Apr 21 2023 18:04:24] [ERROR] [tid 20916] NVLink initialization failed for NVSwitch PCI bus id: 00000000:D6:00.0DeviceName:nvswitch4 PhysicalId:12 NVLinkIndex:31
[Apr 21 2023 18:04:24] [ERROR] [tid 20916] NVLink initialization failed for fid:0 GPU PCI bus id:00000000:84:00.0 enumIndex:6 NVLinkIndex 10
[Apr 21 2023 18:04:24] [INFO] [tid 20916] All Links NVLink trunk (NVSwitch to NVSwitch) connections trained to high speed

Your fabric manager is not installed or not installed correctly. If you do a proper OS load of your system, including fabric manager, and you still have errors in the fabric manager log, you should address it with your system vendor. Alternatively, if you have purchased support via e.g. NVIDIA AI Enterprise, you can contact NVIDIA Enterprise Support to request assistance.

Thanks Robert. We got the server from supermicro but they don’t offer support in the weekend. This is a brand-new server, I think I just missed something, can you give me a hint? Because this is urgent, so I tried to get help here. I tried to start fabric manager, but failed:

nv-fabricmanager[20916]: NVLink initialization failed for NVSwitch PCI bus id: 00000000:D6:00.0DeviceName:nvswitch4 PhysicalId:12 NVLinkIndex:31
nv-fabricmanager[20916]: NVLink initialization failed for fid:0 GPU PCI bus id:00000000:84:00.0 enumIndex:6 NVLinkIndex 10
nv-fabricmanager[20916]: Successfully configured all the available GPUs and NVSwitches to route NVLink traffic.

I don’t know the history of your machine up to this point. My suggestion would be to reload the OS, then load the NVIDIA GPU driver using a package manager method (for example, install CUDA), then load the fabric manager using the instructions in that guide I linked, then start the fabric manager using the instructions in that guide, then check things again.

If you don’t like the idea of reloading the OS (its difficult to give precise steps if you don’t know the history of the machine) you can try removing all NVIDIA packages from your machine, then follow the steps after loading the OS.

If you go through those steps and end up in the same place, then I think its possible there is a hardware issue with your server, in which case SMC is the right entity to get involved.

Thanks for the suggestions! I removed all NVIDIA packages and reinstalled them. I enabled nvidia-dcgm service, then I could see the 6 nvswitches:

6 NvSwitches found.
±----------+
| Switch ID |
±----------+
| 12 |
| 11 |
| 10 |
| 9 |
| 8 |
| 13 |
±----------+

However, it looked to me one(ID: 12) of six had problem:

]# dcgmi policy -g 1 --get -v
Policy information
±----------------------------±-----------------------------------------------+
| Policy Information |
| Switch ID: 12 |
+=============================+================================================+
| Violation conditions | Double-bit ECC errors |
| | PCI errors and replays |
| | Max retired pages threshold - 0 |
| | Max temperature threshold - 0 |
| | XID error detected. |
| Isolation mode | |
| Action on violation | None |
| Validation after action | |
| Validation failure action | |
±----------------------------±-----------------------------------------------+
±----------------------------±-----------------------------------------------+
| Policy Information |
| Switch ID: 11 |
+=============================+================================================+
| Violation conditions | None |
| Isolation mode | Automatic |
| Action on violation | None |
| Validation after action | None |
| Validation failure action | None |
±----------------------------±-----------------------------------------------+
±----------------------------±-----------------------------------------------+
| Policy Information |
| Switch ID: 10 |
+=============================+================================================+
| Violation conditions | None |
| Isolation mode | Automatic |
| Action on violation | None |
| Validation after action | None |
| Validation failure action | None |
±----------------------------±-----------------------------------------------+
±----------------------------±-----------------------------------------------+

<skip other nvswitch, because they looked good>

Is it possible I can disable only the nvswitch ID:12, and leave other five nvswitches enabled? I just want to make sure if we did have a hardware problem on the nvswitch, and if disabling it can allow cuda applications to run.

Thanks!!

I had a problem with a 4xA100 box where both host CUDA samples and docker PyTorch self-contained binaries returned cudaGetDeviceCount errors. I just figured out the root cause: A100 has MIG mode enabled by default, thus making getDeviceCount() error out.

I solved it by using sudo nvidia-smi -mig 0to disable MIG mode, reboot since it was pending (perhaps due to loginctl linger settings), and the do sudo nvidia-smi -mig 0 again.

2 Likes

I created the account just to say thank you, @ZhanwenChen . Your note saved the day (actually multiple days so far)!