I have a problem, my tasks crashes when using several computing nodes. The problem is the same as described here: VASP support site: Forums / Bugreports / [SOLVED] VASP 5 crashes when using several computing nodes (large memory) http://cms.mpi.univie.ac.at/vasp-forum/forum_viewtopic.php?3.12037 . The offered solution is to change memory limits, but mlx4 driver on our cluster doesn’t have “log_mtts_per_seg” parameter. Can I change maximum amount of registerable memory using this driver? Or the only way is to update OFED to version 1.5?
Yes, it is possible to change the log_mtts_per seg. It is a parameter of mlx4_core kernel module.
you will need to set it up via modprob.conf and reload the driver.
the two parameters you can change would be:
options mlx4_core log_num_mtt=
options mlx4_core log_mtts_per_seg=
You can check what the default values are using modinfo mlx4_core and then change to higher values.
Unfortunately, mlx4_core kernel module of OFED-1.3 doesn’t have this parameter. As far as I know it appeared only in OFED-1.5. So modifing modprob.conf with options log_mtts_per_seg cause an error and the driver doesn’t launch.
Is there a way to change maximum amount of registerable memory in old mlx4 driver?
I see. I am not sure if those were exposed at all on OFED 1.3. can’t really give yo a good answer.
Any chance you can upgrade the driver?
Unfortunately, OFED 1.5 doesn’t support our OS (SLES 10 sp1) and we can’t upgrade it. It seems that we need to install other linux system on our cluster…