Problem with memory allocation inside NVidia Aerial

I’ve been trying to run NVidia Aerial and have encountered the following problem:

[14:14:40:690128][48][DOCA][ERR][linux_mapped_user_memory.cpp:75][linux_mapped_user_memory] Failed to register user memory. Got errno: Cannot allocate memory
[14:14:40:690178][48][DOCA][ERR][doca_mmap.cpp:167][priv_doca_mmap_dev_to_mkey_init_mkey] Failed to initialize mkey: failed to create memory region with exception:
[14:14:40:690194][48][DOCA][ERR][doca_mmap.cpp:167][priv_doca_mmap_dev_to_mkey_init_mkey] DOCA exception [DOCA_ERROR_DRIVER] with message Failed to register user memory
[14:14:40:690201][48][DOCA][ERR][doca_mmap.cpp:313][priv_doca_mmap_init_dev_to_mkey] Mmap 0x248c0c03180: Failed to initialize device=0x248a74f9c40. err=DOCA_ERROR_DRIVER
[14:14:40:690206][48][DOCA][ERR][doca_mmap.cpp:350][priv_doca_mmap_init_dev_to_mkeys] Mmap 0x248c0c03180: Failed to initialize memory range. Failed to register MR for device with id: 1. err=DOCA_ERROR_DRIVER
14:14:40.690206 ERR phy_init 0 [AERIAL_INVALID_PARAM_EVENT] [FH.DOCA] doca_mmap_start error
14:14:40.690206 ERR phy_init 0 [AERIAL_INVALID_PARAM_EVENT] [FH.DOCA] doca_mmap_start error
14:14:40.690208 ERR phy_init 0 [AERIAL_DPDK_API_EVENT] [FH.NIC] Could not alloc flow DOCA tx buffer

I’m running this in a ARM machine so the commands that I’m running are:

export CUDA_DEVICE_MAX_CONNECTIONS=8
export CUDA_MPS_PIPE_DIRECTORY=/var
export CUDA_MPS_LOG_DIRECTORY=/var
sudo -E echo quit | sudo -E nvidia-cuda-mps-control
sudo -E nvidia-cuda-mps-control -d
sudo -E echo start_server -uid 0 | sudo -E nvidia-cuda-mps-control
sudo -E LD_LIBRARY_PATH=/opt/mellanox/dpdk/lib/aarch64-linux-gnu:/opt/mellanox/doca/lib/aarch64-linux-gnu $cuBB_SDK/build/cuPHY-CP/cuphycontroller/examples/cuphycontroller_scf P5G_FXN

It seems to be failing when calling the function doca_mmap_start. My problem doesn’t seem to be related to the resizable BAR (I already have this activated). Has anyone experienced this and has any idea how to solve it?

@emelao would you please run the commands below in the container and share the outputs with us?
cd /opt/nvidia/cuBB/cuPHY/util/cuBB_system_checks
sudo -E python3 ./cuBB_system_checks.py

Here it is:

-----General--------------------------------------
Hostname                           : supermicro
IP address                         : 10.30.1.18
Linux distro                       : "Ubuntu 22.04.4 LTS"
Linux kernel version               : 6.2.0-1012-nvidia
-----Kernel Command Line--------------------------
Audit subsystem                    : audit=0
Clock source                       : N/A
HugePage count                     : hugepages=32
HugePage size                      : hugepagesz=512M
CPU idle time management           : idle=poll
Max Intel C-state                  : N/A
Intel IOMMU                        : N/A
IOMMU                              : N/A
Isolated CPUs                      : N/A
Corrected errors                   : N/A
Adaptive-tick CPUs                 : nohz_full=4-47
Soft-lockup detector disable       : nosoftlockup
Max processor C-state              : processor.max_cstate=0
RCU callback polling               : rcu_nocb_poll
No-RCU-callback CPUs               : rcu_nocbs=4-47
TSC stability checks               : tsc=reliable
-----CPU------------------------------------------
CPU cores                          : 80
Thread(s) per CPU core             : 1
CPU MHz:                           : N/A
CPU sockets                        : 1
-----Environment variables------------------------
CUDA_DEVICE_MAX_CONNECTIONS        : 8
cuBB_SDK                           : /opt/nvidia/cuBB
-----Memory---------------------------------------
HugePage count                     : 32
Free HugePages                     : 26
HugePage size                      : 524288 kB
Shared memory size                 : 128G
-----Nvidia GPUs----------------------------------
GPU driver version                 : 535.129.03
CUDA version                       : 12.2
GPU0
  GPU product name                 : NVIDIA A100 80GB PCIe
  GPU persistence mode             : Disabled
  Current GPU temperature          : 44 C
  GPU clock frequency              : 1410 MHz
  Max GPU clock frequency          : 1410 MHz
  GPU PCIe bus id                  : 00000000:01:00.0
-----GPUDirect topology---------------------------
	GPU0	NIC0	NIC1	NIC2	NIC3	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	SYS	SYS	SYS	SYS	0-79	0		N/A
NIC0	SYS	 X 	PIX	SYS	SYS				
NIC1	SYS	PIX	 X 	SYS	SYS				
NIC2	SYS	SYS	SYS	 X 	PIX				
NIC3	SYS	SYS	SYS	PIX	 X 				

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3


-----Mellanox NICs--------------------------------
-----Mellanox NIC Interfaces----------------------
-----Linux PTP------------------------------------


-----Software Packages----------------------------
cmake       /usr/local/bin         : 3.25.1
docker                             : N/A
gcc         /usr/bin               : 11.4.0
git-lfs     /usr/bin               : 3.0.2
MOFED                              : N/A
meson       /usr/bin               : 0.61.2
ninja       /usr/bin               : 1.10.2
ptp4l                              : N/A
-----Loaded Kernel Modules------------------------
GDRCopy                            : gdrdrv
GPUDirect RDMA                     : nvidia_peermem
Nvidia                             : nvidia
-----Non-persistent settings----------------------
VM swappiness                      : vm.swappiness = 0
VM zone reclaim mode               : vm.zone_reclaim_mode = 0
-----Docker images--------------------------------

And just to be clear: I’m not using all the correct hardware that are suggested. I’m using the ConnectX6-LX, an ARM processor and A100 GPU. I was following the Grace Hopper procedure (because they use a ARM processor too), but trying to adjust to my hardware (like the ConnectX6-LX).

Thanks for Sharing the detailed info. Aerial 24-1 wasn’t tested on other Arm-core server+A100 than GraceHopper server. Would like to know more updates from your setup. can you share lscpu info for your server?
Regarding CX6-LX vs. CX6DX, please refer to the listed Advance timing and synchronization features for two cards in the following documents. Again, we recommend using CX6DX.

I tested it with the 23-4 and had the same problem, unfortunately. Here is my lscpu info:

Architecture:           aarch64
  CPU op-mode(s):       32-bit, 64-bit
  Byte Order:           Little Endian
CPU(s):                 80
  On-line CPU(s) list:  0-79
Vendor ID:              ARM
  Model name:           Neoverse-N1
    Model:              1
    Thread(s) per core: 1
    Core(s) per socket: 80
    Socket(s):          1
    Stepping:           r3p1
    Frequency boost:    disabled
    CPU max MHz:        3000.0000
    CPU min MHz:        1000.0000
    BogoMIPS:           50.00
    Flags:              fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
Caches (sum of all):    
  L1d:                  5 MiB (80 instances)
  L1i:                  5 MiB (80 instances)
  L2:                   80 MiB (80 instances)
NUMA:                   
  NUMA node(s):         1
  NUMA node0 CPU(s):    0-79
Vulnerabilities:        
  Gather data sampling: Not affected
  Itlb multihit:        Not affected
  L1tf:                 Not affected
  Mds:                  Not affected
  Meltdown:             Not affected
  Mmio stale data:      Not affected
  Retbleed:             Not affected
  Spec rstack overflow: Not affected
  Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:           Mitigation; __user pointer sanitization
  Spectre v2:           Mitigation; CSV2, BHB
  Srbds:                Not affected
  Tsx async abort:      Not affected

Regarding the NIC, I hadn’t really noticed these differences. But do you think this problem could be caused by that alone? I will try to get one CX6DX (I just have the CX6LX and CX4LX).

Thank you for the cpu info capture and the notes. This SMC server has Ampere arm-core, right? If so, you can using the kernel as recommended for x86.
Yes, the problem you are seeing could be related with NIC card. We can look into this issue to confirm once your have CX6DX.

There is no option for the linux-image-5.15.0-1042-nvidia-lowlatency for ARM. At least I couldn’t find it.

$ sudo apt-get install -y linux-image-5.15.0-1042-nvidia-lowlatency
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
E: Unable to locate package linux-image-5.15.0-1042-nvidia-lowlatency
E: Couldn't find any package by glob 'linux-image-5.15.0-1042-nvidia-lowlatency'
E: Couldn't find any package by regex 'linux-image-5.15.0-1042-nvidia-lowlatency'

@emelao sorry that you could not find the kernel linux-image-5.15.0-1042-nvidia-lowlatency for ARM. You can try to use kernel “5.15.0-58-lowlatency” on your setup. For the problem of cuphycontroller exiting with “err=DOCA_ERROR_DRIVER” in your first message, please run the command “sudo modprobe nvidia-peermem” once Aerial cuBB container after launching to container and before running any cuphycontroller_scf command. Please let us know the outcome.
Thanks!