I have a question about my weird phenomenon.
My program launch about 2 millon threads and it works well.
But after just changed the my motherboard, my program can’t launch well.
sometimes occur launch time out error, sometimes occur unspecific error.
I don’t understand and I don’t know how to handle this problem.
With such a vague description it is impossible to give any sort of firm diagnosis. The motherboard itself is unlikely to be at fault, it is more likely that it has something to do with disassembling and re-assembling your system when changing the motherboard. Things to check:
(1) Are all power connectors to the GPU plugged in (depending on the GPU, this may be a 6-pin connector, an 8-pin connector, or both a 6-pin and an 8-pin connector)?
(2) Is the GPU firmly seated in the PCIe slot, and its bracket secured by fastening to the case (this may be use a screw, or a latching bar, or some other mechanism) ?
(3) Is the GPU plugged into the correct PCIe slot (you would want a x16 slot, there may also be x4 slots)?
for me it looks like it’s not slower, but failing. i suggest you to enumerate ALL the things that were changed. from my own experience, when adding new HDDs to the system, i switched GPU to another PCI-E cable from PSU. and the second cable wasn’t capable to deliver 75W as it should, so i had trooubles until i’ve changed it back
From the description the most likely cause here is insufficient power supply to the GPU, causing clock throttling, causing slow operation that makes kernel time out. That’s why I listed checking the GPU power connectors as the first item.
Thanks for all suggestion.
I totally agree with your suggestions. I think it was problem with power supplier.
It was not power supplier problem.
when I changed the WDDM TDR enabled option to False in the Nsight option menu.
defalut value is true and WDDM will reset the driver if the GPU is not give some response over than 2 seconds.
My work have consumed about 3 seconds so it was out.
While WDDM TDR can definitely be an issue for long-running kernels, it doesn’t explain why the behavior changed after you swapped the motherboard (the WDDM watch dog timer was in effect before and after). My hypothesis was that your GPU is throttling, thus operating more slowly than before, as a consequence hitting the watch dog timer limit. Throttling can be caused by insufficient power supply or overheating.
I had checked the voltage and temperature. but it was not looks like weired. and I changed the HDDs and re-install OS and CUDA toolkit. so my configuration values was initialized.
I think problem was come from that option value. by the way, I appreciate to your suggestion.