Okay, this is interesting that it stops crashing with a stream event. I’m not sure if this points in either direction, it might be reducing power consumption and preventing the crash, but it also might be suggesting my theory is wrong and that a new PSU won’t fix the problem.
I guess it’s worth asking if you know that your PSU is too small for the peak power consumption of all the components in your system? Should your current one be big enough on paper, or is it clear that if the CPU & GPU & disk & lights are all at full power draw at the same time, then there might not be enough?
It’d be really interesting to check an Nsight Systems profile with and without your stream callback. The thing to look for is how packed together the launches are, how much time there is between launches. Newer versions of Nsight Systems have a graph that shows you the overall GPU utilization.
Another thing to try is to run nvidia-smi in a tight loop while you repro both the crash behavior and the non-crash when your stream callback is turned on. nvidia-smi will show you the power consumption, and maybe you’ll be able to see a difference in the sum total power during your program run. (I’m sure there’s an API for querying power too, if you wanted to put code in your application that measures usage.) If you can find a way to run your GPU that runs at equal or higher power without crashing than your run when it does crash, that would tend to rule out the PSU being the problem.
Maybe having a callback between every launch does slow it down enough to cross a wattage threshold, but I’m not sure I’m confident that would change the power consumption enough to be seeing these effects reliably. If you’re doing your callback between every single launch, it might be worth testing a callback after a multiple-launch batch at a time, to get higher density of launches between the callbacks.
Another thing you could try is queue up more than 1 stream, and still use a callback between launches or batches. The goal being to make sure the GPU is working on another kernel on stream 2 while servicing the callback on stream 1, to basically saturate the GPU to see if you can trigger the power event before 1000 launches, to try to rule out whether having a full kernel launch queue is causing the problem.