launch timed out and was terminated....

Can anyone explain what might be causing this? It doesn’t seem to be compiler or hardware dependent.

0: copyout Memcpy (host=0x7f840030, dev=0x8198000, size=16384) FAILED: 6(the launch timed out and was terminated)


Hi Rob,

Do you have your monitor attached to your GPU? X windows will kill any long running job (~5-15 seconds) if your GPU is attached to a monitor.

  • Mat

Hi Mat,

The job runs for about 2 hours before failing. We do have a diagnostic monitor available in the rack, but I think it’s disconnected. I will check on Monday, but I don’t think it’s this. I did wonder whether it’s because one of my GPUs is sitting idle for too long waiting for the other cards to finish.

I’ll do further tests and let you know what I find.


Mat, thanks for pointing us in the right direction. Although we didn’t have a monitor connected, we were running X. Turning X off resulted in my code hanging rather than crashing. The cause of the hang is I believe a “killer particle” (it’s a Monte Carlo code) that ends up in an infinite loop.

Thanks again,


The same message occurred in the following test case while input N=102400.
However, N=92160 or less would be fine.
My gpu device is GeForce GT 415M and PVF version 10.9.
The OS is Win7.
The execution was compiled with Compute Capacity 2.0 and CUDA toolkit 3.1.
Did I miss anything…?

Module kmod
   use cudafor
   Attributes(global) subroutine vaddkernel(A,B,C,N)
     real, device:: A(*), B(*), C(*)
     integer, value:: N
     integer:: I, j
     i = (blockidx%x-1)*512 + threadidx%x
     if ( I <= N )then
       do j=1, N
         C(i)= A(i) + B(i)
       end do
     end if
   End subroutine vaddkernel
End module kmod
program gpu_test
use dfport
use kmod
implicit none
real,allocatable:: A(:), B(:), C(:), D(:), E(:)
real,device,allocatable:: AA(:), BB(:), CC(:)
type(dim3)::dimGrid, dimBlock

integer:: N,i
real:: s1, s2

read(*,*) N
dimGrid  = dim3(N/512, 1, 1)
dimBlock = dim3(512, 1, 1)

allocate(A(N), B(N), C(N), D(N), E(N))
allocate(AA(N), BB(N), CC(N))

do i=1,N
  A(i) = real(i)
  B(i) = cos(real(i))
end do

call cpu_time(s1)
call acc_vadd( A, B, D, N)
call cpu_time(s2)
print*, "GPU, PGI Accelerator (acc) vadd: ", s2-s1

call cpu_time(s1)
call vaddkernel <<<dimGrid,dimBlock>>> ( AA, BB, CC, N)
call cpu_time(s2)
print*, "GPU, PGI CUDA Fortran vadd: ", s2-s1
deallocate(AA, BB, CC)

do i=1,N
  if( E(i)/=D(i) )then
    print*, "different", i, E(i)-D(i)
  end if
end do
end program gpu_test

Hi cyFeng,

I suspect it’s the same issue where the OS is killing your long running job. This can occur if your GPU is attached to a monitor.

Note the following code in your kernel:

     if ( I <= N )then
       do j=1, N
         C(i)= A(i) + B(i)
       end do
     end if

You have every thread execute the vector add N times. Granted you may be doing this on purpose for benchmarking, but it is causing your code to take a lot longer then it should. To fix, remove the do loop.

Hope this helps,

Hi Mat,

I did the benchmark on my notebook characterized by the NVIDIA OPTIMUS technique. So, the GPU should be attached to monitor. Thank you for the response.

Is there any solution to prevent the running job from being killed by the OS ?
I have the only one laptop to test Cuda Fortran… :(

Hi cyFeng,

Do a web search for “CUDA Windows Watchdog Timer” and you’ll find a work around. However, the work around requires you to edit your registry and disable the GPU watchdog, leaving your systems susceptible to freeze-ups. You can try this but it’s not recommended. Instead, you should consider breaking your long running kernels into smaller, shorter, ones, using smaller data sets, or getting a dedicated compute GPU.

  • Mat

Hi, Mat

Actually… I’m breaking the long running kernels into smaller ones.
That is the spirit of CUDA.
Thanks for your kind response :)