Is my card burned?

Hello,

A strange issue apperead on my Titan card yessterday. At the begging of my code I check several times how much ram is used. I get the following

6201464 KB free of total 6291264 KB at the beginning
...
2007160 KB free of total 2096960 KB before the cufft plans are made .
...
6135928 KB free of total 6291264 KB after the cufft plans are made.

Suddenly it appears like I have only 2GB of RAM. In between the 2 first calls there are no cuda lines.

Moreover my code started to crash after this periodically after running almost the same amount of time. I have Asus motherboard P79, 1 660 Ti and 1 Titan (both EVGA). What can I do to check my card? The temperature at crash is about 80 C, but it stays like that for quite long.

I’m assuming you tried rebooting already?

Maybe try using something like this?
http://wili.cc/blog/gpu-burn.html

Hello,

Thanks you for your reply. I am not sure now if the crashing and the memory query, might not be related. I got the same error on 3 differemt cards 2 660 TI and the Titan. The crashing occurs when the temperature is around 80 C for a longer time. The crash is related to a kernel:

kupdt< < < ggrid , tthreads > > >( (cufftDoubleComplex*)dppsi, (cufftDoubleComplex*)dbbff, ddqq, totsize_invspa,totsize,rr,ddt,q0); 

__global__ void kupdt(cufftDoubleComplex *A,cufftDoubleComplex *B,const double *C,int totsize_inspa,int totsize,const double rr,double ddt,const double q0)
{    
    int ibl=blockIdx.y+blockIdx.x*gridDim.y;
    int ind=threadIdx.x+blockDim.x*ibl;
    double qq,fff,ccc;
    if(ind<totsize_inspa)
    {
    qq=C[ind];
    ccc=rr+(1.0+2.0*qq+qq*qq)*(q0*q0*q0*q0+2.0*qq*q0*q0+qq*qq);
    fff=(1.0/(1.0-ddt*ccc*qq))/((double) totsize);
    A[ind].x=A[ind].x*fff+B[ind].x*qq*ddt*fff; 
    A[ind].y=A[ind].y*fff+B[ind].y*qq*ddt*fff;
    }   
}

I am having ptoblems trying to reproduce the error on the debugger, the error is some unspecified kernel launch, but one time managed to get that some address is misaligned. My variables ddpsi and dbbf are declared as cuffdouble because I wanted to do inplace transforms in order to save memory.

Here is some error I got from cuda-memcheck

========= CUDA-MEMCHECK
========= Unknown Error
========= Unknown Error
=========
========= Unknown Error
=========
========= Program hit error 4 on CUDA API call to cudaFuncSetCacheConfig 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/libcuda.so [0x26b0b0]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcudart.so.5.0 (cudaFuncSetCacheConfig + 0x221) [0x2efb1]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcufft.so.5.0 [0x4e4416]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcufft.so.5.0 [0x3977ac]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcufft.so.5.0 [0x30735c]
=========     Host Frame:./a.out [0x2f44]
=========     Host Frame:./a.out [0x1a54]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xed) [0x2176d]
=========     Host Frame:./a.out [0x24f1]
=========
========= Program hit error 4 on CUDA API call to cudaLaunch 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/libcuda.so [0x26b0b0]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcudart.so.5.0 (cudaLaunch + 0x242) [0x2f592]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcufft.so.5.0 [0x4e2c84]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcufft.so.5.0 [0x4e45ae]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcufft.so.5.0 [0x3977ac]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcufft.so.5.0 [0x30735c]
=========     Host Frame:./a.out [0x2f44]
=========     Host Frame:./a.out [0x1a54]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xed) [0x2176d]
=========     Host Frame:./a.out [0x24f1]
=========
========= Program hit error 4 on CUDA API call to cudaGetLastError 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/libcuda.so [0x26b0b0]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcudart.so.5.0 (cudaGetLastError + 0x1e6) [0x2a046]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcufft.so.5.0 [0x4e43ba]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcufft.so.5.0 [0x3977ac]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcufft.so.5.0 [0x30735c]
=========     Host Frame:./a.out [0x2f44]
=========     Host Frame:./a.out [0x1a54]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xed) [0x2176d]
=========     Host Frame:./a.out [0x24f1]
=========
========= Program hit error 4 on CUDA API call to cudaLaunch 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/libcuda.so [0x26b0b0]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcudart.so.5.0 (cudaLaunch + 0x242) [0x2f592]
=========     Host Frame:./a.out [0x2b69]
=========     Host Frame:./a.out [0x2f99]
=========     Host Frame:./a.out [0x1a54]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xed) [0x2176d]
=========     Host Frame:./a.out [0x24f1]
=========
========= Program hit error 4 on CUDA API call to cudaPeekAtLastError 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/libcuda.so [0x26b0b0]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcudart.so.5.0 (cudaPeekAtLastError + 0x1e6) [0x2a2e6]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcufft.so.5.0 [0x4e4791]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcufft.so.5.0 [0x3a4cd9]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcufft.so.5.0 [0x30735c]
=========     Host Frame:./a.out [0x1a54]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xed) [0x2176d]
=========     Host Frame:./a.out [0x24f1]
=========
========= Program hit error 4 on CUDA API call to cudaLaunch 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/libcuda.so [0x26b0b0]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcudart.so.5.0 (cudaLaunch + 0x242) [0x2f592]
=========     Host Frame:./a.out [0x2e75]
=========     Host Frame:./a.out [0x2fdf]
=========     Host Frame:./a.out [0x1a54]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xed) [0x2176d]
=========     Host Frame:./a.out [0x24f1]
=========
========= Program hit error 4 on CUDA API call to cudaPeekAtLastError 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/libcuda.so [0x26b0b0]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcudart.so.5.0 (cudaPeekAtLastError + 0x1e6) [0x2a2e6]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcufft.so.5.0 [0x4e1f4c]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcufft.so.5.0 [0x3978e3]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcufft.so.5.0 [0x30735c]
=========     Host Frame:./a.out [0x2f36]
=========     Host Frame:./a.out [0x1a54]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xed) [0x2176d]
=========     Host Frame:./a.out [0x24f1]
=========
========= Program hit error 4 on CUDA API call to cudaPeekAtLastError 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/libcuda.so [0x26b0b0]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcudart.so.5.0 (cudaPeekAtLastError + 0x1e6) [0x2a2e6]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcufft.so.5.0 [0x4e1f4c]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcufft.so.5.0 [0x3978e3]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcufft.so.5.0 [0x30735c]
=========     Host Frame:./a.out [0x2f44]
=========     Host Frame:./a.out [0x1a54]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xed) [0x2176d]
=========     Host Frame:./a.out [0x24f1]
=========
========= Program hit error 4 on CUDA API call to cudaLaunch 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/libcuda.so [0x26b0b0]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcudart.so.5.0 (cudaLaunch + 0x242) [0x2f592]
=========     Host Frame:./a.out [0x2b69]
=========     Host Frame:./a.out [0x2f99]
=========     Host Frame:./a.out [0x1a54]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xed) [0x2176d]
=========     Host Frame:./a.out [0x24f1]
=========
========= Program hit error 4 on CUDA API call to cudaPeekAtLastError 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/libcuda.so [0x26b0b0]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcudart.so.5.0 (cudaPeekAtLastError + 0x1e6) [0x2a2e6]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcufft.so.5.0 [0x4e4791]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcufft.so.5.0 [0x3a4cd9]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcufft.so.5.0 [0x30735c]
=========     Host Frame:./a.out [0x1a54]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xed) [0x2176d]
=========     Host Frame:./a.out [0x24f1]
=========
========= Program hit error 4 on CUDA API call to cudaLaunch 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/libcuda.so [0x26b0b0]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcudart.so.5.0 (cudaLaunch + 0x242) [0x2f592]
=========     Host Frame:./a.out [0x2e75]
=========     Host Frame:./a.out [0x2fdf]
=========     Host Frame:./a.out [0x1a54]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xed) [0x2176d]
=========     Host Frame:./a.out [0x24f1]
=========
========= Program hit error 4 on CUDA API call to cudaPeekAtLastError 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/libcuda.so [0x26b0b0]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcudart.so.5.0 (cudaPeekAtLastError + 0x1e6) [0x2a2e6]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcufft.so.5.0 [0x4e1f4c]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcufft.so.5.0 [0x3978e3]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcufft.so.5.0 [0x30735c]
=========     Host Frame:./a.out [0x2f36]
=========     Host Frame:./a.out [0x1a54]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xed) [0x2176d]
=========     Host Frame:./a.out [0x24f1]
=========
========= Program hit error 4 on CUDA API call to cudaPeekAtLastError 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/libcuda.so [0x26b0b0]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcudart.so.5.0 (cudaPeekAtLastError + 0x1e6) [0x2a2e6]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcufft.so.5.0 [0x4e1f4c]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcufft.so.5.0 [0x3978e3]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcufft.so.5.0 [0x30735c]
=========     Host Frame:./a.out [0x2f44]
=========     Host Frame:./a.out [0x1a54]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xed) [0x2176d]
=========     Host Frame:./a.out [0x24f1]
=========
========= Program hit error 4 on CUDA API call to cudaLaunch 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/libcuda.so [0x26b0b0]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcudart.so.5.0 (cudaLaunch + 0x242) [0x2f592]
=========     Host Frame:./a.out [0x2b69]
=========     Host Frame:./a.out [0x2f99]
=========     Host Frame:./a.out [0x1a54]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xed) [0x2176d]
=========     Host Frame:./a.out [0x24f1]
=========
========= Program hit error 4 on CUDA API call to cudaPeekAtLastError 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/libcuda.so [0x26b0b0]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcudart.so.5.0 (cudaPeekAtLastError + 0x1e6) [0x2a2e6]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcufft.so.5.0 [0x4e4791]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcufft.so.5.0 [0x3a4cd9]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcufft.so.5.0 [0x30735c]
=========     Host Frame:./a.out [0x1a54]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xed) [0x2176d]
=========     Host Frame:./a.out [0x24f1]
=========
========= Program hit error 4 on CUDA API call to cudaLaunch 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/libcuda.so [0x26b0b0]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcudart.so.5.0 (cudaLaunch + 0x242) [0x2f592]
=========     Host Frame:./a.out [0x2e75]
=========     Host Frame:./a.out [0x2fdf]
=========     Host Frame:./a.out [0x1a54]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xed) [0x2176d]
=========     Host Frame:./a.out [0x24f1]
=========
========= Program hit error 4 on CUDA API call to cudaPeekAtLastError 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/libcuda.so [0x26b0b0]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcudart.so.5.0 (cudaPeekAtLastError + 0x1e6) [0x2a2e6]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcufft.so.5.0 [0x4e1f4c]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcufft.so.5.0 [0x3978e3]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcufft.so.5.0 [0x30735c]
=========     Host Frame:./a.out [0x2f36]
=========     Host Frame:./a.out [0x1a54]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xed) [0x2176d]
=========     Host Frame:./a.out [0x24f1]
=========
========= Program hit error 4 on CUDA API call to cudaPeekAtLastError 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/libcuda.so [0x26b0b0]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcudart.so.5.0 (cudaPeekAtLastError + 0x1e6) [0x2a2e6]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcufft.so.5.0 [0x4e1f4c]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcufft.so.5.0 [0x3978e3]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcufft.so.5.0 [0x30735c]
=========     Host Frame:./a.out [0x2f44]
=========     Host Frame:./a.out [0x1a54]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xed) [0x2176d]
=========     Host Frame:./a.out [0x24f1]
=========
========= Program hit error 4 on CUDA API call to cudaLaunch 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/libcuda.so [0x26b0b0]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcudart.so.5.0 (cudaLaunch + 0x242) [0x2f592]
=========     Host Frame:./a.out [0x2b69]
=========     Host Frame:./a.out [0x2f99]
=========     Host Frame:./a.out [0x1a54]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xed) [0x2176d]
=========     Host Frame:./a.out [0x24f1]
=========
========= Program hit error 4 on CUDA API call to cudaPeekAtLastError 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/libcuda.so [0x26b0b0]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcudart.so.5.0 (cudaPeekAtLastError + 0x1e6) [0x2a2e6]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcufft.so.5.0 [0x4e4791]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcufft.so.5.0 [0x3a4cd9]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcufft.so.5.0 [0x30735c]
=========     Host Frame:./a.out [0x1a54]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xed) [0x2176d]
=========     Host Frame:./a.out [0x24f1]
=========
========= Program hit error 4 on CUDA API call to cudaLaunch 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/libcuda.so [0x26b0b0]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcudart.so.5.0 (cudaLaunch + 0x242) [0x2f592]
=========     Host Frame:./a.out [0x2e75]
=========     Host Frame:./a.out [0x2fdf]
=========     Host Frame:./a.out [0x1a54]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xed) [0x2176d]
=========     Host Frame:./a.out [0x24f1]
=========
========= Program hit error 4 on CUDA API call to cudaPeekAtLastError 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/libcuda.so [0x26b0b0]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcudart.so.5.0 (cudaPeekAtLastError + 0x1e6) [0x2a2e6]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcufft.so.5.0 [0x4e1f4c]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcufft.so.5.0 [0x3978e3]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcufft.so.5.0 [0x30735c]
=========     Host Frame:./a.out [0x2f36]
=========     Host Frame:./a.out [0x1a54]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xed) [0x2176d]
=========     Host Frame:./a.out [0x24f1]
=========
========= Program hit error 4 on CUDA API call to cudaPeekAtLastError 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/libcuda.so [0x26b0b0]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcudart.so.5.0 (cudaPeekAtLastError + 0x1e6) [0x2a2e6]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcufft.so.5.0 [0x4e1f4c]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcufft.so.5.0 [0x3978e3]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcufft.so.5.0 [0x30735c]
=========     Host Frame:./a.out [0x2f44]
=========     Host Frame:./a.out [0x1a54]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xed) [0x2176d]
=========     Host Frame:./a.out [0x24f1]
=========
========= Program hit error 4 on CUDA API call to cudaLaunch 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/libcuda.so [0x26b0b0]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcudart.so.5.0 (cudaLaunch + 0x242) [0x2f592]
=========     Host Frame:./a.out [0x2b69]
=========     Host Frame:./a.out [0x2f99]
=========     Host Frame:./a.out [0x1a54]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xed) [0x2176d]
=========     Host Frame:./a.out [0x24f1]
=========
========= Program hit error 4 on CUDA API call to cudaPeekAtLastError 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/libcuda.so [0x26b0b0]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcudart.so.5.0 (cudaPeekAtLastError + 0x1e6) [0x2a2e6]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcufft.so.5.0 [0x4e4791]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcufft.so.5.0 [0x3a4cd9]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcufft.so.5.0 [0x30735c]
=========     Host Frame:./a.out [0x1a54]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xed) [0x2176d]
=========     Host Frame:./a.out [0x24f1]
=========
========= Program hit error 4 on CUDA API call to cudaLaunch 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/libcuda.so [0x26b0b0]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcudart.so.5.0 (cudaLaunch + 0x242) [0x2f592]
=========     Host Frame:./a.out [0x2e75]
=========     Host Frame:./a.out [0x2fdf]
=========     Host Frame:./a.out [0x1a54]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xed) [0x2176d]
=========     Host Frame:./a.out [0x24f1]
=========
========= Program hit error 4 on CUDA API call to cudaPeekAtLastError 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/libcuda.so [0x26b0b0]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcudart.so.5.0 (cudaPeekAtLastError + 0x1e6) [0x2a2e6]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcufft.so.5.0 [0x4e1f4c]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcufft.so.5.0 [0x3978e3]
=========     Host Frame:/usr/local/cuda-5.0/lib64/libcufft.so.5.0 [0x30735c]
=========     Host Frame:./a.out [0x2f36]
=========     Host Frame:./a.out [0x1a54]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xed) [0x2176d]
=========     Host Frame:./a.out [0x24f1]

I fixed the memory query by changing from %d to %llu, but the crashes still occur. Meanwhile I also reinstalled the system and pu the 313 driver.

My programs are crashing when the temperature gets at 80 C. Can I do something to make the fans spin faster so that it never reachs the temperature?

It is probably easier and more effective to improve the airflow inside your case so the GPU fans don’t have to do as much.

I would try to clear space around both sides of the GPU card itself, make sure cables don’t obstruct the air flow, and see if you can put a 120mm fan on the front of the case that will blow over the GPU. A barrier to direct hot air from the CPU fan away from the GPU might also help.

Incidentally, I would also contact EVGA and see if they are concerned by this behavior. I’ve never seen problems with a GPU at 80C.

Based on my tests in the last days it appears the card has some problems. I just need to find a test which will give a consistent crash so that I can send it back for repairs/replacement. On the other hand if the problems start to appear only around 80 C, so I f I could just have the fans running at higher speed keeping the temp below 70 C I think I would ok.

Before I send it to them I need to rule out any software problem. Is there any way to uninstall the nvidia driver complete and then reinstall it?

I wouldn’t think it’s a software problem. I’d go with seibert’s solution and aim a fan at your GPU, or just RMA it if you can’t test it on another system to verify the same issue isn’t present. Perhaps a possible solution to avoid overheating would be something like this if it would fit in your case:

http://www.google.com/#q=pci+slot+cooler

If you have that card as your primary one connected to a display, you should be able to enable a manual fan control setting based on Coolbits as documented here:
http://www.evga.com/forums/tm.aspx?m=1144263

It might be possible to enable manual fan control even if not connected to a display, but it probably involves a bit of trickery.

Edit: Actually here is another user that got manual fan controls to work even without a display connected to his GTX Titan:

https://devtalk.nvidia.com/default/topic/542275/geforce-titan-throttles-itself-at-80c

Thanks for the pci slot cooler suggestions. I will try one, I have some slots free on the motherboard.

3 x PCIe 3.0/2.0 x16 (dual x16 or x16, x8, x8)
2 x PCIe 2.0 x1

I suspect there is something wrong in my configuration as well. It appears there is no difference in performance when I tick the enable double precision box.

Hello,

Thank you for your replies. I was away for 2 weeks so I was not able to try all. I figure out so far that the problem is with power management. When the card gets hot it runs at max speed for a while and then instead of increasing the fan rpm just switches to a lower consuming state. When this happens the drivers seems to crash and a restart is needed to be sure that the next programs run correctly. I tried to edit the xorg.conf to make it believe that there is an external monitor attached to it and activate coolbits, but so far I was unsuccesfull. I tried connecting a second monitor to thecard, but no results. My only success was to change from adaptive to max performance. By doing this the programs ran without problem for a long time. I ordered a pci cooler and togheter with max performance mode activated I hope it will be enough.

Glad you were able to troubleshoot the issue! This sounded like a very scary problem.

The trick with the constant fan speed is useful for the servers, but it wears off the gpu fans and it is pretty loud for home use. I was able to do it only for the main card.

My programs are based cufft and use a lots of memory, there might be as well something about cufft.

Hello,

I am writting back to you because I manage to get something for my dual card computer which might be useful to you. I followed the instructions in this link http://blog.cryptohaze.com/2011/02/nvidia-fan-speed-control-for-headless.html .

It requires an extra monitor. I connected the second monitor to my TITAN and I opened nvidia-settings with sudo. In there I went to the menu XserverDDisplay configuration and I changes from TwinView to separate X. Save the xorg.conf and add the coolbit line in xorg. Now after restart you can have the option to control the fan of TITAN. After restart you can remove the second monitor. At this moment I am not sure if you need the second monitor at each restart, but for me it is ok, because I have 2 anyway. I am now running an iterative cufft based algorithm for a matrix of 8192x8192 and the temperatures are around 50C , while before they were at about 80C

Hello,

Is there anyway to test the card?