I’m presently doing some work in the realm of applications which are resilient to GPU hardware failures, and I’d like to know if there are any tools or formal way of going about this. My setup so far is to start a long-running computation on the GPU (in this case, matrix multiplication with large matrices) then turning the GPU off before the computation can complete. I do this because I’d like to, as accurately as possible, simulate the GPU falling off the bus during computation. I’m on Linux and the command I use to turn off the GPU is:
setpci -s <deviceId> command=0000
This causes control to immediately return to the host where I check for GPU availability using a query from NVML to determine GPU availability. At this point, if the GPU is unavailable, my intention is to probe for another available GPU or wait for some period and try the previously failed GPU.
As it turns out, writing an application which dynamically recovers from such a hard hardware failure is slowly proving to be non-trivial. I initially thought it would simply be a matter of re-executing previously defined functions, but this is not so. Trying to do this has taken me down the path of reading information specific to the process from the proc directory in order to manually close file descriptors and manually perform memory unmapping. In addition to all this, I’ve also had to figure out how to manually unload dynamically linked libraries as well as terminate pthreads.
The reason I need to manually do these things is because libcuda opens references to files, performs memory maps and spawns threads without (with good reason!) providing handles to any of these, my only option is to, therefore, use all sorts of sorcery and trickery to gain access to these.
The nvidia module does not get reloaded if any process is holding an active reference (which is the case if I don’t terminate my application). After achieving all of the above; I’m finally able to get the module to reload, but all is not well because any execution of a cuda function after this introduces a segmentation fault. I believe this to have something to do with invalid references in the symbol table somewhere (if anyone has any idea how to fix this PLEASE let me know).
So at this point, things have gotten too technical, and feels non-portable, so I began wondering if there are any formal ways of simulating this kind of failure and recovering from it. If anyone knows anything, pointing me in that direction will be much appreciated. Thanks.