Hello,
I want to use a server as a development/debugging and benchmarking platform for multiple users.
I’ve installed a Windows Server 2008 R2 x64 with a Tesla C1060 with TCC driver (311.35 or 314.07 - tried both).
I’ve ran a test to check if it’s possible to run a debugging session while another process tries to run another kernel on the same GPU.
If one uses the GPU in debug with breakpoints while another process tries to run a kernel on this same GPU, Windows eventually reboots with a BSOD “Uncorrectable Hardware Error”.
Even though i’m aware that GT200 cannot run concurrent kernels, i’m quite surprised of the behavior. I would have thought that the “release kernel” would be stuck waiting for the end of the GPU debugging session.
Here’s the scenario (reproduced a couple of times…) :
- debugging a kernel with a breakpoint
- launching another kernel from another process on the same GPU
- the kernel launch and debugging session freeze while Windows is still running smoothly
- launching nvidia-smi freezes also
- After a minute or so (tdrdelay is set at 60, but i think it’s not related to my problem since i’m in TCC driver mode), nvidia-smi is released and shows the GPU Utilization with an “Err” status
- Next nvidia-smi run shows a 0% Utilization on the GPU
- 30 seconds or so later, BSOD “Uncorrectable Hardware Error”
By the way, i have another question :
Is it possible to run two debugging sessions on two different GPUs on the same machine at the same time ? Do i need an instance of Nsight Monitor by user listening on a unique port
I have of course checked that running independent debugging sessions runs smoothly.
/! edit :
[…] After a long session of debugging with memory errors (Memory Checker activated), the BSOD appears…
Any tips is welcome.
Antoine