Howto reset GTX adapter in Linux?

How do you reset the GTX adapter running Linux without manually unplug and replug the power GPU cable?

We are supporting a lot of users running CUDA programs and need to find a way to reset the GPUs between user jobs, since we are having difficulty determining when a GPU is in a “bad” state.

If a GPU gets into a “bad” state, it MUST be manually power cycled (power cable unplugged and replugged). The GPUs don’t behave properly since they don’t power down/reset if you just reboot the nodes. Note that power cycle (power cable unplugged and replugged) of the GPUs is a manual process for the linux nodes for now. Also, as a corollary to this, ONLY unplug and replug GPU’s power cable when the nodes connected to them are also powered off.

We have not found any good method of identifying when a GPU gets into a “bad” state, the only thing I’ve noticed is that if you reboot the node and run lspci, the output should look similar to:

0c:00.0 3D controller: nVidia Corporation GF100 [M2070] (rev a3)
0c:00.1 Audio device: nVidia Corporation GF100 High Definition Audio Controller (rev a1)
0d:00.0 3D controller: nVidia Corporation GF100 [M2070] (rev a3)
0d:00.1 Audio device: nVidia Corporation GF100 High Definition Audio Controller (rev a1)

If the devices are noted as (rev ff) and not (rev a1), then the device will crash if accessed with some CUDA commands, e.g. deviceQuery.

This morning when I tried to run deviceQuery on c208, the test failed and then the node crashed on me the second time:

c208$ /opt/apps/cuda_SDK/4.0/C/bin/linux/release/deviceQuery
[deviceQuery] starting…
/opt/apps/cuda_SDK/4.0/C/bin/linux/release/deviceQuery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 10
→ invalid device ordinal
[deviceQuery] test results…
FAILED

Press ENTER to exit…

Rebooting the node resulted in the same behavior and lspci showed this:

0d:00.0 3D controller: nVidia Corporation GF100 [M2070] (rev ff)
0d:00.1 Audio device: nVidia Corporation GF100 High Definition Audio Controller (rev ff)
0e:00.0 3D controller: nVidia Corporation GF100 [M2070] (rev ff)
0e:00.1 Audio device: nVidia Corporation GF100 High Definition Audio Controller (rev ff)

It was only after I manually unplug and replug the nodes and GPUs power cables did the devices recover properly.

Has anyone, running CUDA programs, found a good solution for automatically resetting the GPUs? Or determining when the GPUs are in a “bad” state?

Later,
David