I have been trying to run the entire set of tests using nvvs. I’ve tried command line options and configuration files, but neither method will run more than than the basic Deployment set of tests.
It seems that no matter what I try, nvvs is intent on skipping most of the tests.
This is on Linux CentOS and I have 8 Titan Xs installed.
For example:
[~]# nvvs -g
DCGM GPU Diagnostic (version 387.40)
Supported GPUs available:
[00000000:04:00.0] – TITAN Xp
[00000000:05:00.0] – TITAN Xp
[00000000:08:00.0] – TITAN Xp
[00000000:09:00.0] – TITAN Xp
[00000000:84:00.0] – TITAN X (Pascal)
[00000000:85:00.0] – TITAN X (Pascal)
[00000000:88:00.0] – TITAN X (Pascal)
[00000000:89:00.0] – TITAN X (Pascal)
[~]# nvvs -t
DCGM GPU Diagnostic (version 387.40)
Tests available:
PCIe – This plugin will exercise the PCIe bus for a given list of GPUs.
Targeted Stress – This plugin will keep the list of GPUs at a constant stress level.
Targeted Power – This plugin will keep the list of GPUs at a constant power level.
Memory – This plugin will test the memory of a given GPU.
Deployment – Software deployment checks plugin.
Diagnostic – This plugin will stress the framebuffer of a list of GPUs.
Memory Bandwidth – This plugin will test the memory bandwidth of a list of GPUs.
SM Stress – This plugin will keep the SMs on the list of GPUs at a constant stress level.
[~]# nvvs -i 0 --specifiedtest Memory
DCGM GPU Diagnostic (version 387.40)
Deployment
Blacklist ......................................... PASS
NVML Library ...................................... PASS
CUDA Main Library ................................. PASS
Permissions and OS-related Blocks ................. PASS
Persistence Mode .................................. PASS
Environmental Variables ........................... PASS
Page Retirement ................................... PASS
Graphics Processes ................................ PASS
Inforom ........................................... PASS
Custom
Memory GPU0 ....................................... SKIP
*** The Memory test is skipped for this GPU.
[~]# nvvs -i 0 --specifiedtest Diagnostic
DCGM GPU Diagnostic (version 387.40)
Deployment
Blacklist ......................................... PASS
NVML Library ...................................... PASS
CUDA Main Library ................................. PASS
Permissions and OS-related Blocks ................. PASS
Persistence Mode .................................. PASS
Environmental Variables ........................... PASS
Page Retirement ................................... PASS
Graphics Processes ................................ PASS
Inforom ........................................... PASS
Custom
Diagnostic ........................................ SKIP
*** The Diagnostic is skipped for this GPU.