Dual GPU (x2 TitanX), Ubuntu 14.04, 352.39: one card fails the ./conjugateGradient sample test.

Hi everyone,

I have a fresh Ubuntu 14.04 install in a PC with two Titan X's. Immediately after installing Ubuntu 14.04, I followed this thread: <a target='_blank' rel='noopener noreferrer' href='https://devtalk.nvidia.com/default/topic/878117'>https://devtalk.nvidia.com/default/topic/878117</a> for installing NVIDIA drivers ~exactly~ as discussed there (see last post to the thread by NeuroSurfer). Everything worked.

It is mentioned that one should do two tests after the installation from the CUDA samples code,
./deviceQuery

to see your graphics card specs and

./bandwidthTest

I ran the two and both cards pass both tests.

I use these graphics cards to train neural network models; One of the TitanX's keeps generating NaNs (not-a-number values), whereas training the exact same model on the other TitanX works as expected and without any errors. After much trial-and-error, I found out that one of the cards fails the ./conjugateGradient CUDA sample, available under
~/NVIDIA_CUDA-7.5_Samples/7_CUDALibraries/conjugateGradient

hereas the other card succeeds in it. The output of the card with error is:

GPU Device 0: "GeForce GTX TITAN X" with compute capability 5.2

> GPU device has 24 Multi-Processors, SM 5.2 compute capabilities

Test Summary:  Error amount = 1.000000

The successful card outputs:

GPU Device 0: "GeForce GTX TITAN X" with compute capability 5.2

> GPU device has 24 Multi-Processors, SM 5.2 compute capabilities

iteration =   1, residual = 4.449882e+01
iteration =   2, residual = 3.245218e+00
iteration =   3, residual = 2.690220e-01
iteration =   4, residual = 2.307639e-02
iteration =   5, residual = 1.993140e-03
iteration =   6, residual = 1.846193e-04
iteration =   7, residual = 1.693379e-05
iteration =   8, residual = 1.600115e-06
Test Summary:  Error amount = 0.000000

Also, when I run the

~/NVIDIA_CUDA-7.5_Samples/7_CUDALibraries/conjugateGradientUM

sample, one card succeeds with message:

Starting [conjugateGradientUM]...
GPU Device 0: "GeForce GTX TITAN X" with compute capability 5.2

> GPU device has 24 Multi-Processors, SM 5.2 compute capabilities

iteration =   1, residual = 4.449882e+01
iteration =   2, residual = 3.245218e+00
iteration =   3, residual = 2.690220e-01
iteration =   4, residual = 2.307639e-02
iteration =   5, residual = 1.993140e-03
iteration =   6, residual = 1.846193e-04
iteration =   7, residual = 1.693379e-05
iteration =   8, residual = 1.600115e-06
Final residual: 1.600115e-06
&&&& uvm_cg test PASSED
Test Summary:  Error amount = 0.000000, result = SUCCESS

The other card fails and outputs:

Starting [conjugateGradientUM]...
GPU Device 0: "GeForce GTX TITAN X" with compute capability 5.2

> GPU device has 24 Multi-Processors, SM 5.2 compute capabilities

Final residual: 0.000000e+00
&&&& uvm_cg test PASSED
Bus error (core dumped)

Could someone help me solving this issue? Any help is greatly appreciated. The installation log file: https://www.dropbox.com/s/98sqflyxw7w1zk0/nvidia-installer.log?dl=0
.

Thanks!