Setup H100 with KVM to use vGPU

Hello everyone,

I am stuck with the installation and configuration of a H100 with KVM (OpenNebula)

Goal: A server with a H100 (PCI) should be divided into vGPU and passed on to multiple virtual machines (KVM, OpenNebula) using time-slicing.

I have already tried Debian12 and Ubuntu24.04 and have the same problem with both operating systems, so I assume I am doing something wrong.

I went through Nvidia’s instructions from top to bottom and came across the first error at this point.
Instuctions I followed: Virtual GPU Software User Guide - NVIDIA Docs

What I have done so far:
Installed: nvidia-vgpu-ubuntu-aie-550_550.144.02_amd64.deb

root@hgpu02:/# nvidia-smi
Wed Feb  5 13:12:36 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.02             Driver Version: 550.144.02     CUDA Version: N/A      |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 PCIe               On  |   00000000:20:00.0 Off |                    0 |
| N/A   32C    P0             50W /  350W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

I disabled nouveau

root@hgpu02:/# cat /etc/modprobe.d/blacklist-nouveau.conf
blacklist nouveau
options nouveau modeset=0

update-initramfs -u 
reboot

root@hgpu02:/# lsmod | grep nouveau
root@hgpu02:/# lsmod | grep vfio
nvidia_vgpu_vfio      114688  10
nvidia               8839168  4 nvidia_vgpu_vfio
vfio_pci_core          86016  1 nvidia_vgpu_vfio
mdev                   24576  1 nvidia_vgpu_vfio
vfio_iommu_type1       49152  0
vfio                   69632  3 vfio_pci_core,nvidia_vgpu_vfio,vfio_iommu_type1
iommufd                98304  1 vfio
kvm                  1404928  2 nvidia_vgpu_vfio,kvm_intel
irqbypass              12288  3 vfio_pci_core,nvidia_vgpu_vfio,kvm

I have successfully installed the nvidia-vgpu-mgr.service and no error messages in the service

root@hgpu02:/# systemctl status nvidia-vgpu-mgr.service
● nvidia-vgpu-mgr.service - NVIDIA vGPU Manager Daemon
     Loaded: loaded (/usr/lib/systemd/system/nvidia-vgpu-mgr.service; enabled; preset: enabled)
     Active: active (running) since Wed 2025-02-05 13:09:47 UTC; 7min ago
   Main PID: 2168 (nvidia-vgpu-mgr)
      Tasks: 1 (limit: 618536)
     Memory: 444.0K (peak: 696.0K)
        CPU: 3.792s
     CGroup: /system.slice/nvidia-vgpu-mgr.service
             └─2168 /usr/bin/nvidia-vgpu-mgr

Feb 05 13:09:47 hgpu02 systemd[1]: Starting nvidia-vgpu-mgr.service - NVIDIA vGPU Manager Daemon...
Feb 05 13:09:47 hgpu02 systemd[1]: Started nvidia-vgpu-mgr.service - NVIDIA vGPU Manager Daemon.
Feb 05 13:09:51 hgpu02 nvidia-vgpu-mgr[2168]: notice: vmiop_env_log: nvidia-vgpu-mgr daemon started

Showing the GPU

root@hgpu02:/# lspci | grep NVIDIA
20:00.0 3D controller: NVIDIA Corporation GH100 [H100 PCIe] (rev a1)

root@hgpu02:/# virsh nodedev-list --cap pci| grep 20_00_0
pci_0000_20_00_0

root@hgpu02:/# virsh nodedev-dumpxml pci_0000_20_00_0| egrep 'domain|bus|slot|function'
    <domain>0</domain>
    <bus>32</bus>
    <slot>0</slot>
    <function>0</function>
    <capability type='virt_functions' maxCount='32'/>
      <address domain='0x0000' bus='0x20' slot='0x00' function='0x0'/>

And now the first error I receive when running sriov-manage:

root@hgpu02:/# /usr/lib/nvidia/sriov-manage -e 00:20:0000.0
Enabling VFs on 0000:20:00.0
/usr/lib/nvidia/sriov-manage: line 90: echo: write error: Operation not permitted

Line 87-92 of the sriov-manage script:

bind_to_nvidia_driver()
{
    local gpu=$1
    echo "$gpu" > /sys/bus/pci/drivers/nvidia/bind
    resume_services
}

I cannot understand why the error message appears. I have already tried giving the files and folders (folder “nvidia”, file “bind”) permissions of 777 to rule out a problem with the file permissions. However, that didn’t help either and the error message remained the same. So I undid the change. This is what the contents of the “nvidia” folder look like now.

root@hgpu02:/sys/bus/pci/drivers/nvidia# ls -la
total 0
drwxr-xr-x  2 root root    0 Feb  5 13:09 .
drwxr-xr-x 39 root root    0 Feb  5 13:09 ..
lrwxrwxrwx  1 root root    0 Feb  5 13:20 0000:20:00.2 -> ../../../../devices/pci0000:1f/0000:1f:01.0/0000:20:00.2
lrwxrwxrwx  1 root root    0 Feb  5 13:20 0000:20:00.3 -> ../../../../devices/pci0000:1f/0000:1f:01.0/0000:20:00.3
lrwxrwxrwx  1 root root    0 Feb  5 13:20 0000:20:00.4 -> ../../../../devices/pci0000:1f/0000:1f:01.0/0000:20:00.4
lrwxrwxrwx  1 root root    0 Feb  5 13:20 0000:20:00.5 -> ../../../../devices/pci0000:1f/0000:1f:01.0/0000:20:00.5
lrwxrwxrwx  1 root root    0 Feb  5 13:20 0000:20:00.6 -> ../../../../devices/pci0000:1f/0000:1f:01.0/0000:20:00.6
lrwxrwxrwx  1 root root    0 Feb  5 13:20 0000:20:00.7 -> ../../../../devices/pci0000:1f/0000:1f:01.0/0000:20:00.7
lrwxrwxrwx  1 root root    0 Feb  5 13:20 0000:20:01.0 -> ../../../../devices/pci0000:1f/0000:1f:01.0/0000:20:01.0
lrwxrwxrwx  1 root root    0 Feb  5 13:20 0000:20:01.1 -> ../../../../devices/pci0000:1f/0000:1f:01.0/0000:20:01.1
lrwxrwxrwx  1 root root    0 Feb  5 13:20 0000:20:01.2 -> ../../../../devices/pci0000:1f/0000:1f:01.0/0000:20:01.2
lrwxrwxrwx  1 root root    0 Feb  5 13:20 0000:20:01.3 -> ../../../../devices/pci0000:1f/0000:1f:01.0/0000:20:01.3
lrwxrwxrwx  1 root root    0 Feb  5 13:20 0000:20:01.4 -> ../../../../devices/pci0000:1f/0000:1f:01.0/0000:20:01.4
lrwxrwxrwx  1 root root    0 Feb  5 13:20 0000:20:01.5 -> ../../../../devices/pci0000:1f/0000:1f:01.0/0000:20:01.5
lrwxrwxrwx  1 root root    0 Feb  5 13:20 0000:20:01.6 -> ../../../../devices/pci0000:1f/0000:1f:01.0/0000:20:01.6
lrwxrwxrwx  1 root root    0 Feb  5 13:20 0000:20:01.7 -> ../../../../devices/pci0000:1f/0000:1f:01.0/0000:20:01.7
lrwxrwxrwx  1 root root    0 Feb  5 13:20 0000:20:02.0 -> ../../../../devices/pci0000:1f/0000:1f:01.0/0000:20:02.0
lrwxrwxrwx  1 root root    0 Feb  5 13:20 0000:20:02.1 -> ../../../../devices/pci0000:1f/0000:1f:01.0/0000:20:02.1
lrwxrwxrwx  1 root root    0 Feb  5 13:20 0000:20:02.2 -> ../../../../devices/pci0000:1f/0000:1f:01.0/0000:20:02.2
lrwxrwxrwx  1 root root    0 Feb  5 13:20 0000:20:02.3 -> ../../../../devices/pci0000:1f/0000:1f:01.0/0000:20:02.3
lrwxrwxrwx  1 root root    0 Feb  5 13:20 0000:20:02.4 -> ../../../../devices/pci0000:1f/0000:1f:01.0/0000:20:02.4
lrwxrwxrwx  1 root root    0 Feb  5 13:20 0000:20:02.5 -> ../../../../devices/pci0000:1f/0000:1f:01.0/0000:20:02.5
lrwxrwxrwx  1 root root    0 Feb  5 13:20 0000:20:02.6 -> ../../../../devices/pci0000:1f/0000:1f:01.0/0000:20:02.6
lrwxrwxrwx  1 root root    0 Feb  5 13:20 0000:20:02.7 -> ../../../../devices/pci0000:1f/0000:1f:01.0/0000:20:02.7
lrwxrwxrwx  1 root root    0 Feb  5 13:20 0000:20:03.0 -> ../../../../devices/pci0000:1f/0000:1f:01.0/0000:20:03.0
lrwxrwxrwx  1 root root    0 Feb  5 13:20 0000:20:03.1 -> ../../../../devices/pci0000:1f/0000:1f:01.0/0000:20:03.1
lrwxrwxrwx  1 root root    0 Feb  5 13:20 0000:20:03.2 -> ../../../../devices/pci0000:1f/0000:1f:01.0/0000:20:03.2
lrwxrwxrwx  1 root root    0 Feb  5 13:20 0000:20:03.3 -> ../../../../devices/pci0000:1f/0000:1f:01.0/0000:20:03.3
lrwxrwxrwx  1 root root    0 Feb  5 13:20 0000:20:03.4 -> ../../../../devices/pci0000:1f/0000:1f:01.0/0000:20:03.4
lrwxrwxrwx  1 root root    0 Feb  5 13:20 0000:20:03.5 -> ../../../../devices/pci0000:1f/0000:1f:01.0/0000:20:03.5
lrwxrwxrwx  1 root root    0 Feb  5 13:20 0000:20:03.6 -> ../../../../devices/pci0000:1f/0000:1f:01.0/0000:20:03.6
lrwxrwxrwx  1 root root    0 Feb  5 13:20 0000:20:03.7 -> ../../../../devices/pci0000:1f/0000:1f:01.0/0000:20:03.7
lrwxrwxrwx  1 root root    0 Feb  5 13:20 0000:20:04.0 -> ../../../../devices/pci0000:1f/0000:1f:01.0/0000:20:04.0
lrwxrwxrwx  1 root root    0 Feb  5 13:20 0000:20:04.1 -> ../../../../devices/pci0000:1f/0000:1f:01.0/0000:20:04.1
--w-------  1 root root 4096 Feb  5 13:20 bind
lrwxrwxrwx  1 root root    0 Feb  5 13:21 module -> ../../../../module/nvidia
--w-------  1 root root 4096 Feb  5 13:21 new_id
--w-------  1 root root 4096 Feb  5 13:21 remove_id
--w-------  1 root root 4096 Feb  5 13:21 uevent
--w-------  1 root root 4096 Feb  5 13:20 unbind

Showing the virtfn:

root@hgpu02:/sys/bus/pci/drivers/nvidia# ls -l /sys/bus/pci/devices/0000:20:00.0/ | grep virtfn
lrwxrwxrwx 1 root root            0 Feb  5 13:20 virtfn0 -> ../0000:20:00.2
lrwxrwxrwx 1 root root            0 Feb  5 13:20 virtfn1 -> ../0000:20:00.3
lrwxrwxrwx 1 root root            0 Feb  5 13:20 virtfn10 -> ../0000:20:01.4
lrwxrwxrwx 1 root root            0 Feb  5 13:20 virtfn11 -> ../0000:20:01.5
lrwxrwxrwx 1 root root            0 Feb  5 13:20 virtfn12 -> ../0000:20:01.6
lrwxrwxrwx 1 root root            0 Feb  5 13:20 virtfn13 -> ../0000:20:01.7
lrwxrwxrwx 1 root root            0 Feb  5 13:20 virtfn14 -> ../0000:20:02.0
lrwxrwxrwx 1 root root            0 Feb  5 13:20 virtfn15 -> ../0000:20:02.1
lrwxrwxrwx 1 root root            0 Feb  5 13:20 virtfn16 -> ../0000:20:02.2
lrwxrwxrwx 1 root root            0 Feb  5 13:20 virtfn17 -> ../0000:20:02.3
lrwxrwxrwx 1 root root            0 Feb  5 13:20 virtfn18 -> ../0000:20:02.4
lrwxrwxrwx 1 root root            0 Feb  5 13:20 virtfn19 -> ../0000:20:02.5
lrwxrwxrwx 1 root root            0 Feb  5 13:20 virtfn2 -> ../0000:20:00.4
lrwxrwxrwx 1 root root            0 Feb  5 13:20 virtfn20 -> ../0000:20:02.6
lrwxrwxrwx 1 root root            0 Feb  5 13:20 virtfn21 -> ../0000:20:02.7
lrwxrwxrwx 1 root root            0 Feb  5 13:20 virtfn22 -> ../0000:20:03.0
lrwxrwxrwx 1 root root            0 Feb  5 13:20 virtfn23 -> ../0000:20:03.1
lrwxrwxrwx 1 root root            0 Feb  5 13:20 virtfn24 -> ../0000:20:03.2
lrwxrwxrwx 1 root root            0 Feb  5 13:20 virtfn25 -> ../0000:20:03.3
lrwxrwxrwx 1 root root            0 Feb  5 13:20 virtfn26 -> ../0000:20:03.4
lrwxrwxrwx 1 root root            0 Feb  5 13:20 virtfn27 -> ../0000:20:03.5
lrwxrwxrwx 1 root root            0 Feb  5 13:20 virtfn28 -> ../0000:20:03.6
lrwxrwxrwx 1 root root            0 Feb  5 13:20 virtfn29 -> ../0000:20:03.7
lrwxrwxrwx 1 root root            0 Feb  5 13:20 virtfn3 -> ../0000:20:00.5
lrwxrwxrwx 1 root root            0 Feb  5 13:20 virtfn30 -> ../0000:20:04.0
lrwxrwxrwx 1 root root            0 Feb  5 13:20 virtfn31 -> ../0000:20:04.1
lrwxrwxrwx 1 root root            0 Feb  5 13:20 virtfn4 -> ../0000:20:00.6
lrwxrwxrwx 1 root root            0 Feb  5 13:20 virtfn5 -> ../0000:20:00.7
lrwxrwxrwx 1 root root            0 Feb  5 13:20 virtfn6 -> ../0000:20:01.0
lrwxrwxrwx 1 root root            0 Feb  5 13:20 virtfn7 -> ../0000:20:01.1
lrwxrwxrwx 1 root root            0 Feb  5 13:20 virtfn8 -> ../0000:20:01.2
lrwxrwxrwx 1 root root            0 Feb  5 13:20 virtfn9 -> ../0000:20:01.3
root@hgpu02:~# lspci -v | grep -i 20\:0
20:00.0 3D controller: NVIDIA Corporation GH100 [H100 PCIe] (rev a1)
20:00.2 3D controller: NVIDIA Corporation GH100 [H100 PCIe] (rev a1)
20:00.3 3D controller: NVIDIA Corporation GH100 [H100 PCIe] (rev a1)
20:00.4 3D controller: NVIDIA Corporation GH100 [H100 PCIe] (rev a1)
20:00.5 3D controller: NVIDIA Corporation GH100 [H100 PCIe] (rev a1)
20:00.6 3D controller: NVIDIA Corporation GH100 [H100 PCIe] (rev a1)
20:00.7 3D controller: NVIDIA Corporation GH100 [H100 PCIe] (rev a1)
20:01.0 3D controller: NVIDIA Corporation GH100 [H100 PCIe] (rev a1)
20:01.1 3D controller: NVIDIA Corporation GH100 [H100 PCIe] (rev a1)
20:01.2 3D controller: NVIDIA Corporation GH100 [H100 PCIe] (rev a1)
20:01.3 3D controller: NVIDIA Corporation GH100 [H100 PCIe] (rev a1)
20:01.4 3D controller: NVIDIA Corporation GH100 [H100 PCIe] (rev a1)
20:01.5 3D controller: NVIDIA Corporation GH100 [H100 PCIe] (rev a1)
20:01.6 3D controller: NVIDIA Corporation GH100 [H100 PCIe] (rev a1)
20:01.7 3D controller: NVIDIA Corporation GH100 [H100 PCIe] (rev a1)
20:02.0 3D controller: NVIDIA Corporation GH100 [H100 PCIe] (rev a1)
20:02.1 3D controller: NVIDIA Corporation GH100 [H100 PCIe] (rev a1)
20:02.2 3D controller: NVIDIA Corporation GH100 [H100 PCIe] (rev a1)
20:02.3 3D controller: NVIDIA Corporation GH100 [H100 PCIe] (rev a1)
20:02.4 3D controller: NVIDIA Corporation GH100 [H100 PCIe] (rev a1)
20:02.5 3D controller: NVIDIA Corporation GH100 [H100 PCIe] (rev a1)
20:02.6 3D controller: NVIDIA Corporation GH100 [H100 PCIe] (rev a1)
20:02.7 3D controller: NVIDIA Corporation GH100 [H100 PCIe] (rev a1)
20:03.0 3D controller: NVIDIA Corporation GH100 [H100 PCIe] (rev a1)
20:03.1 3D controller: NVIDIA Corporation GH100 [H100 PCIe] (rev a1)
20:03.2 3D controller: NVIDIA Corporation GH100 [H100 PCIe] (rev a1)
20:03.3 3D controller: NVIDIA Corporation GH100 [H100 PCIe] (rev a1)
20:03.4 3D controller: NVIDIA Corporation GH100 [H100 PCIe] (rev a1)
20:03.5 3D controller: NVIDIA Corporation GH100 [H100 PCIe] (rev a1)
20:03.6 3D controller: NVIDIA Corporation GH100 [H100 PCIe] (rev a1)
20:03.7 3D controller: NVIDIA Corporation GH100 [H100 PCIe] (rev a1)
20:04.0 3D controller: NVIDIA Corporation GH100 [H100 PCIe] (rev a1)
20:04.1 3D controller: NVIDIA Corporation GH100 [H100 PCIe] (rev a1)

I am not sure if the error message “/usr/lib/nvidia/sriov-manage: line 90: echo: write error: Operation not permitted” is critical… So I continued

In the instructions I am now in step 2.10.3. “Creating an NVIDIA vGPU on a Linux with KVM Hypervisor"
There is a small table that describes the next steps. Here I continued with the part that supports SR-IOV and uses mdev as the VFIO framework.
The Problem is, that this folder is empty:

root@hgpu02:/sys/class/mdev_bus# ls -al
total 0
drwxr-xr-x  2 root root 0 Feb  5 13:09 .
drwxr-xr-x 80 root root 0 Feb  5 13:09 .. 

and now I dont know how i should continue.

Some more logs:

root@hgpu02:~# nvidia-smi -q

==============NVSMI LOG==============

Timestamp                                 : Wed Feb  5 13:38:06 2025
Driver Version                            : 550.144.02
CUDA Version                              : Not Found
vGPU Driver Capability
        Heterogenous Multi-vGPU           : Supported

Attached GPUs                             : 1
GPU 00000000:20:00.0
    Product Name                          : NVIDIA H100 PCIe
    Product Brand                         : NVIDIA
    Product Architecture                  : Hopper
    Display Mode                          : Enabled
    Display Active                        : Disabled
    Persistence Mode                      : Enabled
    Addressing Mode                       : N/A
    vGPU Device Capability
        Fractional Multi-vGPU             : Supported
        Heterogeneous Time-Slice Profiles : Supported
        Heterogeneous Time-Slice Sizes    : Supported
    MIG Mode
        Current                           : Disabled
        Pending                           : Disabled
    Accounting Mode                       : Enabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : censored
    GPU UUID                              : censored
    Minor Number                          : 0
    VBIOS Version                         : 96.00.74.00.1C
    MultiGPU Board                        : No
    Board ID                              : 0x2000
    Board Part Number                     : censored
    GPU Part Number                       : censored
    FRU Part Number                       : N/A
    Module ID                             : 4
    Inforom Version
        Image Version                     : 1010.0200.00.02
        OEM Object                        : 2.1
        ECC Object                        : 7.16
        Power Management Object           : N/A
    Inforom BBX Object Flush
        Latest Timestamp                  : N/A
        Latest Duration                   : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GPU C2C Mode                          : Disabled
    GPU Virtualization Mode
        Virtualization Mode               : Host VGPU
        Host VGPU Mode                    : SR-IOV
        vGPU Heterogeneous Mode           : Disabled
    GPU Reset Status
        Reset Required                    : No
        Drain and Reset Recommended       : No
    GSP Firmware Version                  : 550.144.02
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x20
        Device                            : 0x00
        Domain                            : 0x0000
        Base Classcode                    : 0x3
        Sub Classcode                     : 0x2
        Device Id                         : 0x233110DE
        Bus Id                            : 00000000:20:00.0
        Sub System Id                     : 0x162610DE
        GPU Link Info
            PCIe Generation
                Max                       : 5
                Current                   : 5
                Device Current            : 5
                Device Max                : 5
                Host Max                  : N/A
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 717 KB/s
        Rx Throughput                     : 605 KB/s
        Atomic Caps Inbound               : N/A
        Atomic Caps Outbound              : N/A
    Fan Speed                             : N/A
    Performance State                     : P0
    Clocks Event Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    Sparse Operation Mode                 : Disabled
    FB Memory Usage
        Total                             : 81559 MiB
        Reserved                          : 993 MiB
        Used                              : 0 MiB
        Free                              : 80567 MiB
    BAR1 Memory Usage
        Total                             : 131072 MiB
        Used                              : 1 MiB
        Free                              : 131071 MiB
    Conf Compute Protected Memory Usage
        Total                             : 0 MiB
        Used                              : 0 MiB
        Free                              : 0 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
        JPEG                              : 0 %
        OFA                               : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    ECC Mode
        Current                           : Enabled
        Pending                           : Enabled
    ECC Errors
        Volatile
            SRAM Correctable              : 0
            SRAM Uncorrectable Parity     : 0
            SRAM Uncorrectable SEC-DED    : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
        Aggregate
            SRAM Correctable              : 0
            SRAM Uncorrectable Parity     : 0
            SRAM Uncorrectable SEC-DED    : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
            SRAM Threshold Exceeded       : No
        Aggregate Uncorrectable SRAM Sources
            SRAM L2                       : 0
            SRAM SM                       : 0
            SRAM Microcontroller          : 0
            SRAM PCIE                     : 0
            SRAM Other                    : 0
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows
        Correctable Error                 : 0
        Uncorrectable Error               : 0
        Pending                           : No
        Remapping Failure Occurred        : No
        Bank Remap Availability Histogram
            Max                           : 1280 bank(s)
            High                          : 0 bank(s)
            Partial                       : 0 bank(s)
            Low                           : 0 bank(s)
            None                          : 0 bank(s)
    Temperature
        GPU Current Temp                  : 33 C
        GPU T.Limit Temp                  : 44 C
        GPU Shutdown T.Limit Temp         : -8 C
        GPU Slowdown T.Limit Temp         : -2 C
        GPU Max Operating T.Limit Temp    : 0 C
        GPU Target Temperature            : N/A
        Memory Current Temp               : 52 C
        Memory Max Operating T.Limit Temp : 0 C
    GPU Power Readings
        Power Draw                        : 50.79 W
        Current Power Limit               : 350.00 W
        Requested Power Limit             : 350.00 W
        Default Power Limit               : 310.00 W
        Min Power Limit                   : 200.00 W
        Max Power Limit                   : 350.00 W
    GPU Memory Power Readings
        Power Draw                        : N/A
    Module Power Readings
        Power Draw                        : N/A
        Current Power Limit               : N/A
        Requested Power Limit             : N/A
        Default Power Limit               : N/A
        Min Power Limit                   : N/A
        Max Power Limit                   : N/A
    Clocks
        Graphics                          : 345 MHz
        SM                                : 345 MHz
        Memory                            : 1593 MHz
        Video                             : 765 MHz
    Applications Clocks
        Graphics                          : 1755 MHz
        Memory                            : 1593 MHz
    Default Applications Clocks
        Graphics                          : 1755 MHz
        Memory                            : 1593 MHz
    Deferred Clocks
        Memory                            : N/A
    Max Clocks
        Graphics                          : 1755 MHz
        SM                                : 1755 MHz
        Memory                            : 1593 MHz
        Video                             : 1470 MHz
    Max Customer Boost Clocks
        Graphics                          : 1755 MHz
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : 675.000 mV
    Fabric
        State                             : N/A
        Status                            : N/A
        CliqueId                          : N/A
        ClusterUUID                       : N/A
        Health
            Bandwidth                     : N/A
    Processes                             : None

I’ve been working on this problem for a few days now and haven’t really made any progress :(
I hope there are smart people out there who can help me.

Many greetings!