Greetings,
I have an issue with GPU Titan X GPU.
We use the GPU for computing purposes. It started to produce wrong results. We tried installing it in different machines and compared it to another GPU in the other 2 machines and it is malfunctioning.
We run the following standard test code, public on the internet:
python train.py -net resnet50 -gpu
We expect to see the following output (left part of the picture), we get with other GPUs (of the same Titan X model): - but we get the error: Loss:nan, (right part of the picture) while it should have decreasing numbers as in the first example.
Can you please advise if this is due to the titan x is faulty and should be sent for repair or what?