BlueField-2 in BlueField-X mode does not see A100 GPU

How does one get the A100 GPU to show up on the DPU?

### [6.1.1. Configuring Operation Mode](https://docs.nvidia.com/doca/sdk/installation-guide-for-linux/index.html#configuring-operation-mode)

There are two modes that the NVIDIA Converged Accelerator may operate in:

* Standard mode (default) – the BlueField DPU and the GPU operate separately
* BlueField-X mode – the GPU is exposed to the DPU and is no longer visible on the host

In our case, DPU is running as “Embedded CPU”, according to Nvidia this means it is in “BlueField X Mode”

PCI_DOWNSTREAM_PORT_OWNER[4]        EMBEDDED_CPU(15)

But the GPU does not show up.

Installed the Cuda drivers on a Bluefield 2 DPU. However, it does not see the A100 GPU on its PCIe bus.
The BlueField 2 shows :

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.``` 

The A100 DOES show up on the CPU Host:
```admin@GPUInit:~> nvidia-smi
Thu Feb  9 17:43:56 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-PCI...  Off  | 00000000:17:00.0 Off |                    0 |
| N/A   37C    P0    38W / 250W |      0MiB / 40960MiB |      4%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Device #1:
----------

Device type:    BlueField2
Name:           MBF2H332A-AECO_Ax_Bx
Description:    BlueField-2 P-Series DPU 25GbE Dual-Port SFP56; PCIe Gen4 x8; Crypto and Secure Boot Enabled; 16GB on-board DDR; 1GbE OOB management; HHHL
Device:         /dev/mst/mt41686_pciconf0

Configurations:                              Next Boot
         PCI_DOWNSTREAM_PORT_OWNER[4]        EMBEDDED_CPU(15)
ubuntu@localhost:~$
ubuntu@localhost:~$ whereis nvidia
nvidia: /usr/lib/nvidia /usr/share/nvidia /usr/src/nvidia-525.85.12/nvidia

installed from runfile

What BFB image version have you installed on the DPU?

(We recommend our latest 1.5.1).

Have you power cycled the host after BFB image installation/FW upgrade + use of mlxconfig to change the mode from Standard to Bluefield-X mode?

Did you verify the GPU ownership from lspci?

IE:

root@host:~# lspci | grep -i nv

None

ubuntu@dpu:~$ lspci | grep -i nv

06:00.0 3D controller: NVIDIA Corporation GA20B8 (rev a1)

Check if UEFI secure boot is enabled, run “mokutil --sb-state” from the ARM. If it is enabled, disable it via these instructions. https://docs.nvidia.com/networking/display/BlueFieldDPUOSLatest/UEFI+Secure+Boot#UEFISecureBoot-DisablingUEFISecureBoot. Hopefully, the Nvidia driver will load after disabling UEFI secure boot.

At last, should our deployment documentation been followed, I would suggest opening an Nvidia support case to further troubleshoot.

Thank you for your note. Having run the suggested lspci commands had realized that the GPU ownership was remaining with the host. then realized that the UEFI secure boot would need to be disabled to proceed. At the same time began to question whether or not a BlueField-2 needs to be in a converged accelerator to take ownership, so had paused this experiment.

Does a discrete BlueField-2 (as opposed to one in a converged accelerator) have the ability to go into BlueField-X mode and take control of the GPU? If so I may go back and disable the UEFI secure boot as suggested to proceed with this experiment. (the Bluefield-2 I am using does reflect the status that it is in BlueField-X mode following the use of mlxconfig) Thank you, Brandt

I believed I spoke too fast and mixed up NVIDIA BlueField-2 DPU & Converged Adapter IE:

MBF2H332A-AECOT
NVIDIA BlueField-2 P-Series DPU 25GbE Dual-Port SFP56, PCIe Gen4 x8, Crypto and Secure Boot Enabled, 16GB on-board DDR, 1GbE OOB management

&

Product Name: ROY BlueField-2 + GA100 PCIe Gen4 x8; two 100Gbe/EDR QSFP28 ports, FHFL” “Part number: 699-21004-0230-300”.
The A100X combines an A100 Tensor Core GPU with a NVIDIA BlueField-2 data processing unit on a single module.

2 different products (Converged Adapter or BF-2 adapter).

Does a discrete BlueField-2 (as opposed to one in a converged accelerator) have the ability to go into BlueField-X mode and take control of the GPU?

I would suggest opening a support case so we can validate internally should there be limitations of the functionalities/features or no support for GPU/non converged.