Display driver has stopped responding and has recovered

This situation has been arisen while I continued my experiments with acceleration structures, see https://devtalk.nvidia.com/default/topic/756680/optix/nontrivial-acceleration-is-applied-to-scenes-consisting-of-triangles-only-/
My platforms:
P1: desktop: Win 8.1, x64, VS2010, OptiX 3.5.1, CUDA 5.5, GeForce GTX 560 ti, 448 cores, Driver 320.57.
P2: notebook: Win 8.1, x64, VS2010, OptiX 3.5.1, CUDA 5.5, GeForce GT 650M, 384 cores, Driver 327.23.

I consider P2 as a cluster with (448/32 =>) Np=14 processors working on common memory? Analogously P2 “has” (384/32) Np=12 processors. It is my rough view of the environment. OptiX’s manager uses No processors. So OptiX runs not more that Np-No paths simultaneously.

In certain experiments, my program stops and I see a message: “Display driver has stopped responding and has recovered”.
A scene consists of 81 transparent cubes defined as triangles (12 trgs each). Just before call of
m_context->launch( 0, width, height, SPECTRUM_SIZE );
I request and print AvailableDeviceMemory.
Additionally I set
m_context->setTimeoutCallback(timeoutCallback, 100.0);
Examples:
E1. Platform P1.
E1.1. Sbvh,Bvh, treeDepth = 10
Memory: getAvailableDeviceMemory 684453888
timeoutCallback // 100 secs
timeoutCallback // 100 secs more
So, works fine.

E.1.2. NoAccel,NoAccel, treeDepth = 6
getAvailableDeviceMemory 397721600 (Compile and Run test in Debug mode)
timeoutCallback
timeoutCallback
timeoutCallback
timeoutCallback // 400 secs
Works fine.

E.1.3. NoAccel,NoAccel, treeDepth = 10
Memory: getAvailableDeviceMemory 684453888
Unexpected message:
“Display driver has stopped responding and has recovered”
VS2010 shows the lines 1656-1660 of optixpp_namespace.h, i.e.

inline void ContextObj::checkError(RTresult code) const
  {
    if( code != RT_SUCCESS && code != RT_TIMEOUT_CALLBACK )
      throw Exception::makeException( code, m_context );
  }
		code	-1	RTresult

It seems the break was in the very beginning of the launch().

E2. Platform P2. Some of tests run fine, but:
E1.1. Sbvh,Bvh, treeDepth = 10
Memory: getAvailableDeviceMemory 1842728960
Unexpected message:
“Display driver has stopped responding and has recovered”

E.1.2. NoAccel,NoAccel, treeDepth = 10
Memory: getAvailableDeviceMemory 1842728960
Unexpected message:
“Display driver has stopped responding and has recovered”

In my opinion (and theoretically) while pathtracing a size of used stack is the same independently on an acceleration technique used. Therefore, it is not a stack problem. Moreover, platform P2 has even more memory.

Who received analogous messages? Help me to solve this problem.

sudak

Forum message:“It appears you may have included a phone number in your post. It is against forum policy to post phone numbers. Your post will be reviewed and may be edited or removed.”
No phone numbers included.

sudak

That is the Timeout Detection and Recovery mechanism in WDDM.
Windows OSes beginning with Vista have a kernel module timeout of 2 seconds.
If a kernel module does not react in that time the OS considers it hung and restarts it.
Under Windows XP I think the timeout was 15 seconds but with a blue screen as result.

There have been multiple entries about exactly this, please search the forum first.

If your ray tracing kernel has a single thread which takes longer than two seconds, the display driver is affected by this TDR mechanism.

Mind that the setTimeoutCallback is meant to prevent this situation by remaining below that two second threshold. You can’t set it to 100 seconds and expect it to override that OS behavior. Try setting it to below a second.

Other mechanism to solve this:

  • Do less work more often.
  • Use faster GPUs.
  • Use a Tesla board in TCC driver mode, they are not affected by the WDDM timeout.
  • Increase the TDR timeout of the OS. (Search the web or CUDA docs, needs a new regirstry entry and reboot. Not allowed for shipping apps.)

(The forum software seems to interpret your decimal numbers in the memory statistics as telephone numbers.)

Thank you!!!

Setting
m_context->setTimeoutCallback(timeoutCallback, 0.5);
solved my headache. But calculation time increased dramatically.
As my current interest is to debug a computaional part, your suggestion is a valuable help.
I’ll follow your other suggestions later.

sudak