InfiniBand on Redhat 8 and memorylocked limit

Hey all,

I am working on a project that involves the implementation of the InfiniBand architecture into the hardware, then the implementation of a corresponding library that wraps the InfiniBand verbs in developer friendly functions. In our previous updates, we were running RedHat 7.9, which was fully functional and working as expected. However, with our next release, we will be making the upgrade to RedHat 8.6. In preparation for that, we have begun creating the system drives and testing for the major release. There is one major bug that we are seeing that I have put a large amount of time and effort into that I can’t see to pin down but continues to get weirder and weirder.

To preface everything, I have set the values of /etc/security/limits.conf attributes to allow the amount of memory locked by a process to be unlimited, as well as a few other settings.

* hard memlock unlimited
* soft memlock unlimited

For reference, there are two main ways a developer can start our software: Executing a script via command line or using a dropdown menu on the GUI.

When we are starting the software via the dropdown menu, the dropdown button is able to execute a script that will start the software with no problems. The InfiniBand verbs can be started, but the caveat to this is we are unable to start our sims via the dropdown. Technically, we could for a developer environment, but due to the option of several arguments on the simulation scripts, we would not want to start them this way.

What we want to be able to do is use the command line to start our software and our simulations in one script, which we have implemented currently. However, this is where the problem arises: When we start our processes in this manner, the processes are able to allocate the memory block (simply a char array) via malloc, but run into an issue when attempting to associate this char array with the ibv_mr that would be created with the InfiniBand verb “ibv_reg_mr”. The function will return NULL, indicating that the verb call had failed and the memory region does not exist. This function call had previously functioned correctly on RedHat7.9. Given that this is a Linux command, we get a return value in the “errno” variable. When printing this variable, I see an error code (enum) of 12 (enum ENOMEM) which refers the an error string stating “Cannot allocate memory”.

This next part gets a little bit confusing and makes things a little weirder. If I were to open a new terminal, the execute an ‘su ’ command, essentially “switching users” to myself, then run the same exact command and steps above, the function calls will work. The same goes for if I were to become the root user (‘su’), then back to myself (‘su ’), everything would work and the ibv_reg_mr function calls would be able to associate the allocated char array with the ibv_mr. Additionally, if I modify the size of the char buffer to a value less than ~20000 bytes without executing the “su” command, the allocation and ibv_reg_mr associate will be successful.

One thing to note, when I open a new terminal and run the command “limit”, I see the following:

cputime      unlimited
filesize     unlimited
datasize     unlimited
stacksize    unlimited
coredumpsize 0 kbytes
memoryuse    unlimited
vmemoryuse   unlimited
descriptors  1024
memorylocked 64 kbytes
maxproc      1055744
maxlocks     unlimited
maxsignal    1539797
maxmessage   819200
maxnice      0
maxrtprio    0
maxrttime    unlimited

This is mostly what I would expect except for the “memorylocked 64 kbytes” part of it. As I understand, this means a process is unable to lock more than 64KB for its own use, which should’ve been taken care of in the /etc/security/limits.conf file already. However, if I either execute a “su ” command or ssh in from another machine/server, I see the following when running the “limit” command:

cputime      unlimited
filesize     unlimited
datasize     unlimited
stacksize    unlimited
coredumpsize 0 kbytes
memoryuse    unlimited
vmemoryuse   unlimited
descriptors  1024
memorylocked unlimited
maxproc      1055744
maxlocks     unlimited
maxsignal    1539797
maxmessage   819200
maxnice      0
maxrtprio    0
maxrttime    unlimited

Notice that the “memorylocked” is “unlimited”. This leads me to believe that the issue I am seeing has to do more with the initial terminal and not necessarily a RedHat 8 issue.

Because of the dropdown menu working as well as the same terminal working after initiating an “su” command, my team and I believe we have narrowed it down to something being setup incorrectly with the initial login shell. This could be the ~/.cshrc, /etc/profile* files, or maybe some other place that I am unsure of. Knowing all this combined information, I am looking for a reason that a terminal that is opened in the default manner could/would differ from if I execute an “su” command to the same user that I was currently operating as.

I’m really looking for some general technical support on this so any input you have would be greatly appreciated! I will be happy to answer any questions that you may have in order to clarify your understandings!

Thanks in advance!

Hello andrew.ward,

Welcome, and thank you for posting your inquiry to the NVIDIA Developer Forums!

We recommend engaging our Enterprise support team for the PRM (Programmers Reference Manual) for Infiniband verbs documentation.

Regarding login profiles and limit settings, this is not something that NVIDIA controls - we recommend engaging your OS vendor for support here.

Please do reach out to our Enterprise Support team via the following link to request access to the PRM: https://enterprise-support.nvidia.com/s/create-case

Thanks, and best regards,
NVIDIA Enterprise Experience

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.