Kernel parameters for a different server

I would like to ask about the cuBB installation guide.

The following kernel parameters are for Dell R750 with Xeon Gold 6336Y CPU and 512GB memory.
sudo sed -i ‘s/^GRUB_CMDLINE_LINUX_DEFAULT=“[^”]*/& pci=realloc=off default_hugepagesz=1G hugepagesz=1G hugepages=16 tsc=reliable clocksource=tsc intel_idle.max_cstate=0 mce=ignore_ce processor.max_cstate=0 intel_pstate=disable audit=0 idle=poll rcu_nocb_poll nosoftlockup iommu=off irqaffinity=0-3 isolcpus=managed_irq,domain,4-47 nohz_full=4-47 rcu_nocbs=4-47 noht numa_balancing=disable/’ /etc/default/grub

I am using HPE ProLiant DL380 Gen10 with Intel Xeon Gold 6248 CPU, 40 CPUs, and 512GB memory.
Can you provide a guide for setting kernel parameters?

Hi @twoheons ,

We don’t have the same server model, but the difference from the kernel perspective is only the number of CPU cores and the core number on each CPU socket in general. On R750, even core numbers are on socket 0, and odd core numbers are on socket 1. So, you may need to update irqaffinity, isolcpus, nohz_full, and ruc_nocbs based on the core assignment on DL380.
You should have 2-4 cores at minimum for the kernel and OS (i.e., may need to update irqaffinity) and can use others for Aerial workloads with isolation (i.e., may need to update isolcpus, nohz_full, and ruc_nocbs).

Thank you.

sudo sed -i ‘s/^GRUB_CMDLINE_LINUX_DEFAULT=“[^”]*/& pci=realloc=off default_hugepagesz=1G hugepagesz=1G hugepages=16 tsc=reliable clocksource=tsc intel_idle.max_cstate=0 mce=ignore_ce processor.max_cstate=0 intel_pstate=disable audit=0 idle=poll rcu_nocb_poll nosoftlockup iommu=off irqaffinity=0-3 isolcpus=managed_irq,domain,4-39 nohz_full=4-39 rcu_nocbs=4-39 noht numa_balancing=disable/’ /etc/default/grub

If only the number of CPUs is changed as above, the following booting error occurs when rebooting.
“end Kernel panic - not syncing: Can not allocate SWIOTLB buffer earlier and can’t now provide you with the DMA bounce buffer”

If I exclude “iommu=off” from the kernel parameter, the booting problem is solved.
I can’t understand this problem, but I’m informing you to provide information.

You may need to check some BIOS settings related to MMIO. The error is obviously associated with MMIO regions for SWIOTLB and DMA. We use large memory GPUs and high-speed NICs, and they require significant MMIO space. For Dell R750, we recommend the following settings to enable these hardware. Please check your these kinds of BIOS settings on your server.

I can’t find BIOS setting of Memory Mapped I/O in HPE server.
excluding iommu=off and not setting BIOS related to MMIO will cause problems in future progress?

with iommu=off, it may have impact on Aerial performance. We don’t recommend running aerial with iommu enabled.

what linux kernel version is used?

thank you for your reply.
here it is.

yogurt@134servwe:~$ hostnamectl
Static hostname: 134servwe
Icon name: computer-server
Chassis: server
Machine ID: cc6e6e34367c4119b818f192dcde1143
Boot ID: 676030c48f644598b09e980bd7b50a5b
Operating System: Ubuntu 22.04.5 LTS
Kernel: Linux 5.15.0-1042-nvidia-lowlatency
Architecture: x86-64
Hardware Vendor: HPE
Hardware Model: ProLiant DL380 Gen10

@twoheons
as @nhashimoto confirmed that, we don’t have this model of server so can not check and confirm the BIOs setting for this server and how to fix the issue you are facing. You might need to HPE how to enable large memory GPUs without IOMMU.

Thanks,

Thanks for your reply.
If I do not set “iommu=off”, Is there an error in the aerial function?, or is there a performance degradation that delays the aerial operation?

yes, there is significant performance issue. Aerial won’t be able run properly without configuring iommu=off.