How to verify that notifications from nvsm actually work?

Hi,

a quick question. I’ve configured e-mail notification from nvsm.
I’d like to test if it really work. Is there any way to send “test email”?

Or may I induce some safe controlled event that will rise a “warning” and send a notification? What could it be?

An additional question. How much time is between event occurrence and sending e-mail notification?

Hello M.Kadlof,

What version of DGX OS and NVSM are you using?

In DGX OS 5.1 / NVSM 21.07.x, we have a command to raise a test alert and clear it.

Generating Test Alert for Email

From within an NVSM CLI interactive session, a user may generate a test alert in order to trigger an SMTP instance and receive an email notification.

Create Testalert

NVSM CLI provides a “create testalert” command to generate a dummy alert that will trigger any SMTP or Call Home defined notification
~$ sudo nvsm create testalert

Within an NVSM CLI interactive session, this basic command will generate a dummy alert with default component_id = Test0 and severity = Warning

In order to configure the Severity and Component of a test alert, the NVSM CLI also provides the following advanced command,
~$ sudo nvsm create testalert <component_id>
which allows for alert specificity to match the severity criteria for a notification. For example,
~$ sudo nvsm create testalert Email1 Critical
will generate a dummy alert with component_id = Email1 and severity = Critical

Clear Testalert

NVSM CLI also provides a “clear testalert” command to dismiss a generated dummy alert
~$ sudo nvsm clear testalert

Within an NVSM CLI interactive session, this basic command will clear any test alert with component_id=Test0, even if there are multiple.

To specify which test alert to dismiss, the NVSM CLI also provides the following command,
~$ sudo nvsm clear testalert <component_id>

Show Testalert

To display all generated test alerts, the NVSM CLI provides a “show testalerts” command
~$ sudo nvsm show testalerts

After running “create testalert”, the following output is expected from “show testalerts”
~$ sudo nvsm show testalerts

/systems/localhost/testalerts/alert0
Properties:
system_name = system-name5
message_details = Dummy Test
component_id = Test0
description = No component is reporting an error. This is a test.
event_time = 2021-08-04T15:55:46.926710484-07:00
recommended_action = Please run ‘sudo nvsm clear testalert’ to dismiss this alert.
alert_id = NV-TEST-01
system_serial = To be filled by O.E.M.
message = Test Alert.
severity = Warning
clear_time = -
hidden = false
type = TestAlerts

It will take less than a minute to generate an alert after an event has occurred.

-Mahendra Yekkar
NVSM Team

Hi!
It looks like a feature I miss. Unfortunately my systems are older:

dgx-release = 5.0.5
nvsm = 20.09.26

Is there any simple and safe way to upgrade OS and NVSM? Preferably link to instructions with checklists, step-by-step, etc.

Or alternatively is there any different way to rise a warning without putting servers at unnecessary risk?

M.kadlof,

An update from OS 5.0.5 to OS 5.1 would be covered by our regular package update procedure – and includes updates for NSVM. We have coverage of the exact steps in the DGX OS 5 user guide, here:

Hope that helps.

The procedure should be pretty straightforward, if you run into any issues or questions, please don’t hesitate to file a support ticket with NVIDIA Enterprise Support for assistance.

Chris

May I upgrade only nvsm? I’d rather not to upgrade kernel at the moment (it will break our lustre module).

Hi @m.kadlof,

Yes that should be okay. apt-get install nvsm should do what you need.