I encountered a really terrible thing recently which almost made me crazy.
I try to use DIGIT for deep learning framework with two Titan X equipped in my computer. However, when I run the training program, the GPU reboot automatically. Almost every time!!!
I chosen the LeNet model (published on 1998) for classification.
Before program crashed, I saw the memory utlised rate is 95%. Is it normal? Titan X has 12GB global memory, while I bet the model is not so large.
Did anyone encounter similar problem or can figure out what happened? I don’t think this is a software problem.
BTW, I don’t use 2-way SLI to connect two GPUs together. Is 2-way SLI required for building multi-GPUs system via DIGIT?