Hey all,
I am working on a project that involves the implementation of the InfiniBand architecture into the hardware, then the implementation of a corresponding library that wraps the InfiniBand verbs in developer friendly functions. In our previous updates, we were running RedHat 7.9, which was fully functional and working as expected. However, with our next release, we will be making the upgrade to RedHat 8.6. In preparation for that, we have begun creating the system drives and testing for the major release. There is one major bug that we are seeing that I have put a large amount of time and effort into that I can’t see to pin down but continues to get weirder and weirder.
To preface everything, I have set the values of /etc/security/limits.conf attributes to allow the amount of memory locked by a process to be unlimited, as well as a few other settings.
* hard memlock unlimited
* soft memlock unlimited
For reference, there are two main ways a developer can start our software: Executing a script via command line or using a dropdown menu on the GUI.
When we are starting the software via the dropdown menu, the dropdown button is able to execute a script that will start the software with no problems. The InfiniBand verbs can be started, but the caveat to this is we are unable to start our sims via the dropdown. Technically, we could for a developer environment, but due to the option of several arguments on the simulation scripts, we would not want to start them this way.
What we want to be able to do is use the command line to start our software and our simulations in one script, which we have implemented currently. However, this is where the problem arises: When we start our processes in this manner, the processes are able to allocate the memory block (simply a char array) via malloc, but run into an issue when attempting to associate this char array with the ibv_mr that would be created with the InfiniBand verb “ibv_reg_mr”. The function will return NULL, indicating that the verb call had failed and the memory region does not exist. This function call had previously functioned correctly on RedHat7.9. Given that this is a Linux command, we get a return value in the “errno” variable. When printing this variable, I see an error code (enum) of 12 (enum ENOMEM) which refers the an error string stating “Cannot allocate memory”.
This next part gets a little bit confusing and makes things a little weirder. If I were to open a new terminal, the execute an ‘su ’ command, essentially “switching users” to myself, then run the same exact command and steps above, the function calls will work. The same goes for if I were to become the root user (‘su’), then back to myself (‘su ’), everything would work and the ibv_reg_mr function calls would be able to associate the allocated char array with the ibv_mr. Additionally, if I modify the size of the char buffer to a value less than ~20000 bytes without executing the “su” command, the allocation and ibv_reg_mr associate will be successful.
One thing to note, when I open a new terminal and run the command “limit”, I see the following:
cputime unlimited
filesize unlimited
datasize unlimited
stacksize unlimited
coredumpsize 0 kbytes
memoryuse unlimited
vmemoryuse unlimited
descriptors 1024
memorylocked 64 kbytes
maxproc 1055744
maxlocks unlimited
maxsignal 1539797
maxmessage 819200
maxnice 0
maxrtprio 0
maxrttime unlimited
This is mostly what I would expect except for the “memorylocked 64 kbytes” part of it. As I understand, this means a process is unable to lock more than 64KB for its own use, which should’ve been taken care of in the /etc/security/limits.conf file already. However, if I either execute a “su ” command or ssh in from another machine/server, I see the following when running the “limit” command:
cputime unlimited
filesize unlimited
datasize unlimited
stacksize unlimited
coredumpsize 0 kbytes
memoryuse unlimited
vmemoryuse unlimited
descriptors 1024
memorylocked unlimited
maxproc 1055744
maxlocks unlimited
maxsignal 1539797
maxmessage 819200
maxnice 0
maxrtprio 0
maxrttime unlimited
Notice that the “memorylocked” is “unlimited”. This leads me to believe that the issue I am seeing has to do more with the initial terminal and not necessarily a RedHat 8 issue.
Because of the dropdown menu working as well as the same terminal working after initiating an “su” command, my team and I believe we have narrowed it down to something being setup incorrectly with the initial login shell. This could be the ~/.cshrc, /etc/profile* files, or maybe some other place that I am unsure of. Knowing all this combined information, I am looking for a reason that a terminal that is opened in the default manner could/would differ from if I execute an “su” command to the same user that I was currently operating as.
I’m really looking for some general technical support on this so any input you have would be greatly appreciated! I will be happy to answer any questions that you may have in order to clarify your understandings!
Thanks in advance!