TCC Uncorrectable Hardware Error


I want to use a server as a development/debugging and benchmarking platform for multiple users.
I’ve installed a Windows Server 2008 R2 x64 with a Tesla C1060 with TCC driver (311.35 or 314.07 - tried both).

I’ve ran a test to check if it’s possible to run a debugging session while another process tries to run another kernel on the same GPU.

If one uses the GPU in debug with breakpoints while another process tries to run a kernel on this same GPU, Windows eventually reboots with a BSOD “Uncorrectable Hardware Error”.

Even though i’m aware that GT200 cannot run concurrent kernels, i’m quite surprised of the behavior. I would have thought that the “release kernel” would be stuck waiting for the end of the GPU debugging session.

Here’s the scenario (reproduced a couple of times…) :

  • debugging a kernel with a breakpoint
  • launching another kernel from another process on the same GPU
  • the kernel launch and debugging session freeze while Windows is still running smoothly
  • launching nvidia-smi freezes also
  • After a minute or so (tdrdelay is set at 60, but i think it’s not related to my problem since i’m in TCC driver mode), nvidia-smi is released and shows the GPU Utilization with an “Err” status
  • Next nvidia-smi run shows a 0% Utilization on the GPU
  • 30 seconds or so later, BSOD “Uncorrectable Hardware Error”

By the way, i have another question :
Is it possible to run two debugging sessions on two different GPUs on the same machine at the same time ? Do i need an instance of Nsight Monitor by user listening on a unique port

I have of course checked that running independent debugging sessions runs smoothly.
/! edit :
[…] After a long session of debugging with memory errors (Memory Checker activated), the BSOD appears…

Any tips is welcome.

I think you answered your own question… if GT200 cannot run concurrent kernels, it’s quite possible that’s the reason for the undefined behavior you’re experiencing… As for the second question… I’m not sure as I haven’t tried it.

Perhaps others can comment on either issue.