It seems to be failing when calling the function doca_mmap_start. My problem doesn’t seem to be related to the resizable BAR (I already have this activated). Has anyone experienced this and has any idea how to solve it?
@emelao would you please run the commands below in the container and share the outputs with us?
cd /opt/nvidia/cuBB/cuPHY/util/cuBB_system_checks
sudo -E python3 ./cuBB_system_checks.py
-----General--------------------------------------
Hostname : supermicro
IP address : 10.30.1.18
Linux distro : "Ubuntu 22.04.4 LTS"
Linux kernel version : 6.2.0-1012-nvidia
-----Kernel Command Line--------------------------
Audit subsystem : audit=0
Clock source : N/A
HugePage count : hugepages=32
HugePage size : hugepagesz=512M
CPU idle time management : idle=poll
Max Intel C-state : N/A
Intel IOMMU : N/A
IOMMU : N/A
Isolated CPUs : N/A
Corrected errors : N/A
Adaptive-tick CPUs : nohz_full=4-47
Soft-lockup detector disable : nosoftlockup
Max processor C-state : processor.max_cstate=0
RCU callback polling : rcu_nocb_poll
No-RCU-callback CPUs : rcu_nocbs=4-47
TSC stability checks : tsc=reliable
-----CPU------------------------------------------
CPU cores : 80
Thread(s) per CPU core : 1
CPU MHz: : N/A
CPU sockets : 1
-----Environment variables------------------------
CUDA_DEVICE_MAX_CONNECTIONS : 8
cuBB_SDK : /opt/nvidia/cuBB
-----Memory---------------------------------------
HugePage count : 32
Free HugePages : 26
HugePage size : 524288 kB
Shared memory size : 128G
-----Nvidia GPUs----------------------------------
GPU driver version : 535.129.03
CUDA version : 12.2
GPU0
GPU product name : NVIDIA A100 80GB PCIe
GPU persistence mode : Disabled
Current GPU temperature : 44 C
GPU clock frequency : 1410 MHz
Max GPU clock frequency : 1410 MHz
GPU PCIe bus id : 00000000:01:00.0
-----GPUDirect topology---------------------------
GPU0 NIC0 NIC1 NIC2 NIC3 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X SYS SYS SYS SYS 0-79 0 N/A
NIC0 SYS X PIX SYS SYS
NIC1 SYS PIX X SYS SYS
NIC2 SYS SYS SYS X PIX
NIC3 SYS SYS SYS PIX X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
-----Mellanox NICs--------------------------------
-----Mellanox NIC Interfaces----------------------
-----Linux PTP------------------------------------
-----Software Packages----------------------------
cmake /usr/local/bin : 3.25.1
docker : N/A
gcc /usr/bin : 11.4.0
git-lfs /usr/bin : 3.0.2
MOFED : N/A
meson /usr/bin : 0.61.2
ninja /usr/bin : 1.10.2
ptp4l : N/A
-----Loaded Kernel Modules------------------------
GDRCopy : gdrdrv
GPUDirect RDMA : nvidia_peermem
Nvidia : nvidia
-----Non-persistent settings----------------------
VM swappiness : vm.swappiness = 0
VM zone reclaim mode : vm.zone_reclaim_mode = 0
-----Docker images--------------------------------
And just to be clear: I’m not using all the correct hardware that are suggested. I’m using the ConnectX6-LX, an ARM processor and A100 GPU. I was following the Grace Hopper procedure (because they use a ARM processor too), but trying to adjust to my hardware (like the ConnectX6-LX).
Thanks for Sharing the detailed info. Aerial 24-1 wasn’t tested on other Arm-core server+A100 than GraceHopper server. Would like to know more updates from your setup. can you share lscpu info for your server?
Regarding CX6-LX vs. CX6DX, please refer to the listed Advance timing and synchronization features for two cards in the following documents. Again, we recommend using CX6DX.
I tested it with the 23-4 and had the same problem, unfortunately. Here is my lscpu info:
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 80
On-line CPU(s) list: 0-79
Vendor ID: ARM
Model name: Neoverse-N1
Model: 1
Thread(s) per core: 1
Core(s) per socket: 80
Socket(s): 1
Stepping: r3p1
Frequency boost: disabled
CPU max MHz: 3000.0000
CPU min MHz: 1000.0000
BogoMIPS: 50.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
Caches (sum of all):
L1d: 5 MiB (80 instances)
L1i: 5 MiB (80 instances)
L2: 80 MiB (80 instances)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-79
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Retbleed: Not affected
Spec rstack overflow: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; __user pointer sanitization
Spectre v2: Mitigation; CSV2, BHB
Srbds: Not affected
Tsx async abort: Not affected
Regarding the NIC, I hadn’t really noticed these differences. But do you think this problem could be caused by that alone? I will try to get one CX6DX (I just have the CX6LX and CX4LX).
Thank you for the cpu info capture and the notes. This SMC server has Ampere arm-core, right? If so, you can using the kernel as recommended for x86.
Yes, the problem you are seeing could be related with NIC card. We can look into this issue to confirm once your have CX6DX.
@emelao sorry that you could not find the kernel linux-image-5.15.0-1042-nvidia-lowlatency for ARM. You can try to use kernel “5.15.0-58-lowlatency” on your setup. For the problem of cuphycontroller exiting with “err=DOCA_ERROR_DRIVER” in your first message, please run the command “sudo modprobe nvidia-peermem” once Aerial cuBB container after launching to container and before running any cuphycontroller_scf command. Please let us know the outcome.
Thanks!