Hi
I’m not sure where to place my question. If this is very wrong, please forgive me…
I’ve got a problem regarding a shell-script and the “nvidia-smi” command!
I’ve made a script that as protection against CPU overheating on my Ubuntu Server 14.04.2. The scripts works nicely but I need to make it work on my 4 GPU’s as well.
I’m pretty green when it comes to bash scripts so I’ve been looking for commands which would make it easy for me to edit the script. I found and tested a lot of them, but none seems to give me the output I need! I’ll show you the commands and the output below. And the scripts as well.
What I need is a command which lists the GPU’s the same way the “sensors” command from “lm-sensors” does. So that I can use “grep” to select a GPU and set the variable “newstring” (the temp. two digits). I’ve been trying for a couple of days, but have had no luck. Mostly because the command “nvidia-smi -lso” and/or “nvidia-smi -lsa” doesn’t exist anymore. Think it was an experimental command.
Here’s the commands I found and tested & the output:
This command shows GPU socket number which I could put into the string “str” but the problem is that the temp. is on the next line. I’ve been fiddling with the flag “A 1” but haven’t been able to put it into the script:
# nvidia-smi -q -d temperature | grep GPU
Attached GPUs : 4
GPU 0000:01:00.0
GPU Current Temp : 57 C
GPU Shutdown Temp : N/A
GPU Slowdown Temp : N/A
GPU 0000:02:00.0
GPU Current Temp : 47 C
GPU Shutdown Temp : N/A
GPU Slowdown Temp : N/A
GPU 0000:03:00.0
GPU Current Temp : 47 C
GPU Shutdown Temp : N/A
GPU Slowdown Temp : N/A
GPU 0000:04:00.0
GPU Current Temp : 48 C
GPU Shutdown Temp : N/A
GPU Slowdown Temp : N/A
This command shows the temp in the first line, but there’s no GPU number!?
# nvidia-smi -q -d temperature | grep "GPU Current Temp"
GPU Current Temp : 58 C
GPU Current Temp : 47 C
GPU Current Temp : 47 C
GPU Current Temp : 48 C
This command shows the GPU number you select, but there’s still no output showing the GPU numer/socket/ID!?
# nvidia-smi -q --gpu=0 | grep "GPU Current Temp"
GPU Current Temp : 59 C
And this commands shows the GPU number and the results in the same row!! But, no temperature!!
# nvidia-smi -L
GPU 0: GeForce GTX 750 Ti (UUID: GPU-9785c7c7-732f-1f51-..........)
GPU 1: GeForce GTX 750 (UUID: GPU-b2b1a4a-4dca-0c7f-..........)
GPU 2: GeForce GTX 750 (UUID: GPU-5e6b8efd-7531-777c-..........)
GPU 3: GeForce GTX 750 Ti (UUID: GPU-5b2b1a2f-3635-2a1c-..........)
And a command which shows all 4 GPU’s temp. without anything else. But still I need the GPU number/socket/ID!?
# nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader
58
47
47
48
What I’m wishing for! If I could get a command which made a output like this I would be the happiest guy around:
GPU 0: GeForce GTX 750 Ti GPU Current Temp : 58 C
GPU 1: GeForce GTX 750 GPU Current Temp : 47 C
GPU 2: GeForce GTX 750 GPU Current Temp : 47 C
GPU 3: GeForce GTX 750 Ti GPU Current Temp : 48 C
Here’s the output that “sensors” from “lm-sensors”. As you can see the unit info and the temp is in the same line:
# -----------------------------------------------------------
# coretemp-isa-0000
# Adapter: ISA adapter
# Physical id 0: +56.0°C (high = +80.0°C, crit = +100.0°C)
# Core 0: +56.0°C (high = +80.0°C, crit = +100.0°C)
# Core 1: +54.0°C (high = +80.0°C, crit = +100.0°C)
# Core 2: +54.0°C (high = +80.0°C, crit = +100.0°C)
# Core 3: +52.0°C (high = +80.0°C, crit = +100.0°C)
# -----------------------------------------------------------
Here’s the part of the script that needs changing. As mentioned in the top, this works using the command “sensors” from the application “lm-sensors”. “lm-sensors” doesn’t show GPU temp. when running CUDA and the driver attached, so we need another command to get the GPU’s listed and the temp. shown. You may know another way to fix my problem, if please don’t hesitate to show me.:
[...]
echo "JOB RUN AT $(date)"
echo "======================================="
echo ''
echo 'CPU Warning Limit set to => '$1
echo 'CPU Shutdown Limit set to => '$2
echo ''
echo ''
sensors
echo ''
echo ''
for i in 0 1 2 3
do
str=$(sensors | grep "Core $i:")
newstr=${str:17:2}
if [ ${newstr} -ge $1 ]
then
echo '====================================================================' >>/home/......../logs/watchdogcputemp.log
echo $(date) >>/home/......../logs/watchdogcputemp.log
echo '' >>/home/......../logs/watchdogcputemp.log
echo ' STATUS WARNING - NOTIFYING : TEMPERATURE CORE' $i 'EXCEEDED' $1 '=>' $newstr >>/home/......../logs/watchdogcputemp.log
echo ' ACTION : EMAIL SENT' >>/home/......../logs/watchdogcputemp.log
echo '' >>/home/......../logs/watchdogcputemp.log
echo '====================================================================' >>/home/......../logs/watchdogcputemp.log
# Status Warning Email Sending Code
# WatchdogCpuTemp Alert! Status Warning - Notifying!"
/usr/bin/msmtp -d --read-recipients </home/......../shellscripts/messages/watchdogcputempwarning.txt
echo 'Email Sent.....'
fi
[...]
I hope there’s a bash-script guru out there, ready to solve this issue
Have a nice weekend!
Kind Regards,
Dan Hansen
Denmark
.