I did a NIST sanitize to start fresh now it seems it dosnt have the proper keys loaded for all of the DAC ROCE full speed memory sharing with its cluster partner I upgraded not it keepps rebooting with 0000000000000
0000000000000
when it reboots and just randomly reboots
Which stage of the boot process do you see it this message? Can you share a picture or screenshot?
Thanks,
Jim
missing BASE OS KEY ?
Does your Spark forcefully reboot or do you only see these messages when you look at the logs? I’m not seeing any other issue.
In the last image you sent, I see a script output. Is this a custom script you wrote? Neither nvidia-peermem nor a BaseOS key are required on Spark.
Custom script I wrote to do checks on mdules or am I checking the modules … I was trying to get higher speeds than 13gb/s between nodes for large scake compression testing and the peer mem thing pops up Im just trying to set it up in Head node / worker node with NFS shares accrossed them so they have the same files to run the pyhon scripts against with llama loaded in background for building out some of the tuning of models for my app
yes it just randomly reboots now with a bunch of 0000000000000000000
000000000000000000
0000000000000000000
keeps going untiol ubuntu loads
Thanks for the logs. Can you run this command and send me the logs so we can debug this better?
sudo dmidecode -t 45





