Fatal error:the launch timed out and was terminated

After the code run for 4 iterations,I am getting this error.
Fatal error: kernel1 (the launch timed out and was terminated at …/simple.cu:498)
*** FAILED - ABORTING
I am calling kernels in do while loop. Part of code inside the while loop is as follows,

     CUDA_SAFE_CALL(cudaMemcpy(u, hu, (sizeT), cudaMemcpyHostToDevice),__LINE__);
   	     CUDA_SAFE_CALL(cudaMemcpy(du, hdu, (sizeT), cudaMemcpyHostToDevice),__LINE__);
         DisU<<<dimGrid,dimBlock>>>(uo,vo,wo,u,du,p);
         cudaThreadSynchronize();
         cudaCheckErrors("kernel1");
         // CUDA_SAFE_CALL(cudaMemcpy( hdu, du,(sizeT), cudaMemcpyDeviceToHost),__LINE__);
        	             // CUDA_SAFE_CALL(cudaMemcpy( hu, u,(sizeT), cudaMemcpyDeviceToHost),__LINE__);
         CUDA_SAFE_CALL(cudaMemcpy(v, hv, (sizeT), cudaMemcpyHostToDevice),__LINE__);
         CUDA_SAFE_CALL(cudaMemcpy(dv, hdv, (sizeT), cudaMemcpyHostToDevice),__LINE__);
         DisV<<<dimGrid,dimBlock>>>(uo,vo,wo,v,dv,p);
         cudaDeviceSynchronize();
         cudaCheckErrors("kernel2");
         CUDA_SAFE_CALL(cudaMemcpy(dw, hdw, (sizeT), cudaMemcpyHostToDevice),__LINE__);
         CUDA_SAFE_CALL(cudaMemcpy(w, hw, (sizeT), cudaMemcpyHostToDevice),__LINE__);
         DisW<<<dimGrid,dimBlock>>>(uo,vo,wo,w,dw,p);
         cudaDeviceSynchronize();
         cudaCheckErrors("kernel3");

please help.

Your code (DisU kernel) is taking too long to execute and you are hitting the watchdog.

If you are on windows, please read this sticky forum thread:

[url]
https://devtalk.nvidia.com/default/topic/459869/cuda-programming-and-performance/-quot-display-driver-stopped-responding-and-has-recovered-quot-wddm-timeout-detection-and-recovery-/[/url]

If you are on linux, please read this help article:

[url]USING CUDA AND X | NVIDIA

It’s painful (I know all too well!), but you now have to learn the fine art of breaking a large problem into multiple small problems.

I always time my kernel launches during development and report the times in a log file. I keep an old computer with an old GPU to test worst-case scenarios.

In one situation, I even resorted to having my program at runtime do a small test launch to get a basis time and then extrapolate to the full problem to estimate how much I would have to break down the problem into multiple launches. Ugly but necessary in that case.

The bottom line is that you must immediately develop the habit of doing all CUDA design and coding in such a way that splitting a task into multiple launches is easy. Once you develop this habit, your CUDA life will become a lot simpler.

Or you could use any of the available methods to avoid the watchdog timer so you don’t have to chop your computation into little pieces. The traditional advice on windows is to use a GPU that is in TCC mode. On linux, various options are covered in the link I already provided.

hello
thanks for responding.
Sorry that I have posed the problem wrongly.
I have not posted the entire code.

I still have another kernel Disp<<<>>> which
is also incorporated in the code.After executing
it I face the error mentioned above.

but when I call this as a function which is serial the code works.

Part of code inside the while loop is as follows,

   CUDA_SAFE_CALL(cudaMemcpy(u, hu, (sizeT), cudaMemcpyHostToDevice),__LINE__);
   	     CUDA_SAFE_CALL(cudaMemcpy(du, hdu, (sizeT), cudaMemcpyHostToDevice),__LINE__);
         DisU<<<dimGrid,dimBlock>>>(uo,vo,wo,u,du,p);
        // cudaThreadSynchronize();
         // CUDA_SAFE_CALL(cudaThreadSynchronize(),__LINE__);
         cudaCheckErrors("kernel1");
         // CUDA_SAFE_CALL(cudaMemcpy( hdu, du,(sizeT), cudaMemcpyDeviceToHost),__LINE__);
        	             // CUDA_SAFE_CALL(cudaMemcpy( hu, u,(sizeT), cudaMemcpyDeviceToHost),__LINE__);
         CUDA_SAFE_CALL(cudaMemcpy(v, hv, (sizeT), cudaMemcpyHostToDevice),__LINE__);
         CUDA_SAFE_CALL(cudaMemcpy(dv, hdv, (sizeT), cudaMemcpyHostToDevice),__LINE__);
         DisV<<<dimGrid,dimBlock>>>(uo,vo,wo,v,dv,p);
       //  cudaDeviceSynchronize();
         cudaCheckErrors("kernel2");
         CUDA_SAFE_CALL(cudaMemcpy(dw, hdw, (sizeT), cudaMemcpyHostToDevice),__LINE__);
         CUDA_SAFE_CALL(cudaMemcpy(w, hw, (sizeT), cudaMemcpyHostToDevice),__LINE__);
         DisW<<<dimGrid,dimBlock>>>(uo,vo,wo,w,dw,p);
        // cudaDeviceSynchronize();
         cudaCheckErrors("kernel3");
         /*cudaMemcpy( hu, u,(sizeT), cudaMemcpyDeviceToHost);
         cudaMemcpy( hv, v,(sizeT), cudaMemcpyDeviceToHost);
         cudaMemcpy( hw, w,(sizeT), cudaMemcpyDeviceToHost);
         cudaMemcpy( hdu, du,(sizeT), cudaMemcpyDeviceToHost);
         cudaMemcpy( hdv, dv,(sizeT), cudaMemcpyDeviceToHost);
         cudaMemcpy( hdw, dw,(sizeT), cudaMemcpyDeviceToHost);*/
      //   pressure();
        CUDA_SAFE_CALL(cudaMemcpy(cp, hcp, sizeT, cudaMemcpyHostToDevice),__LINE__);
        DisP<<<dimGrid,dimBlock>>>(u,v,w,du,dv,dw,cp);
         //	 cudaDeviceSynchronize();
       	   cudaCheckErrors("kernel3");

         cudaMemcpy( hu, u,(sizeT), cudaMemcpyDeviceToHost);
         cudaMemcpy( hv, v,(sizeT), cudaMemcpyDeviceToHost);
         cudaMemcpy( hw, w,(sizeT), cudaMemcpyDeviceToHost);
         cudaMemcpy( hdu, du,(sizeT), cudaMemcpyDeviceToHost);
         cudaMemcpy( hdv, dv,(sizeT), cudaMemcpyDeviceToHost);
         cudaMemcpy( hdw, dw,(sizeT), cudaMemcpyDeviceToHost);

I have checked for memory issue using cuda Memcheck,but there is no issue found related to memory.
please help.

If you get the error after Disp then the Disp kernel is taking too long. The solution options are the same as I already indicated.