Issue with Titan X GPU python train.py

Greetings,
I have an issue with GPU Titan X GPU.

We use the GPU for computing purposes. It started to produce wrong results. We tried installing it in different machines and compared it to another GPU in the other 2 machines and it is malfunctioning.

We run the following standard test code, public on the internet:

git clone https://github.com/weiaicunzai/pytorch-cifar100.git

python train.py -net resnet50 -gpu

We expect to see the following output (left part of the picture), we get with other GPUs (of the same Titan X model): - but we get the error: Loss:nan, (right part of the picture) while it should have decreasing numbers as in the first example.

Can you please advise if this is due to the titan x is faulty and should be sent for repair or what?

I do not know, but I’ve chosen TFLite for inference.
But the problem is that most developers teach the model in Pytorch, but want inference to TFlite and there are problems in converting from Pytorch to TFLite.

thetermpapers.org