Maximum launches in que

Hi,

I’m using the CUDA driver api to que up a set of launches for a final frame render. When running on an A6000 card and queuing up over 1000 launches (> 1 000 000 000 rays) the machine crashes and throws a “Kernel Power” event. No errors are thrown out of optix or CUDA, the machine just shuts down and then the “Kernel Power” event pops up in the windows event viewer.


Advanced troubleshooting for Event ID 41 - "The system has rebooted without cleanly shutting down first" - Windows Client Management | Microsoft Docs.

This does not happen on any other GPU that I’ve tried except the A6000 (I’ve tried a few A6000 cards, all sitting in different machines but same hardware config) and I guess it implies that the power supply is to weak. (750w)

But before I go and buy myself a new PSU I’m just wondering if there is any limit to how many optix launches one can que up in a stream? Or maybe a limit on how many tasks in general that can be in que?

Using Optix 7.2 and have tried a couple of production drivers from 460 to 472.

Thanks
Oscar

Hi @oborgstrom!

That’s wild, I haven’t seen the Kernel Power event before! I’m just guessing, but because this doesn’t happen on other GPUs, that it probably is an under-powered PSU.

OptiX doesn’t mange the stream launch queue, that is a pure CUDA feature. My understanding is that there is a finite sized queue (and I even found another internet post mentioning the number 1000), but that it should not possible to exceed the queue size accidentally and cause a crash.

The CUDA expert Robert Crovella mentioned in this Stack Overflow post https://stackoverflow.com/a/53989841 that when you reach the queue size, CUDA launch calls will be come synchronous until the queue isn’t full anymore. That should mean that you can’t crash your GPU by launching too many kernels at the same time. BUT - note that this could have bad consequences on what you expect to happen versus what actually happens. If you launch 1200 kernels on one stream, and then 1200 kernels on another stream, for example, then the surprise synchronous behavior could prevent you from even starting to issue the 2nd stream kernels until after many of the first stream launches are complete. This could also stall your host thread unexpectedly in ways that prevent other parallel host work. Just note that the launch call itself is not always async.

One way you might be able to test this, and potentially a way you might want to architect your renderer anyway, is to manage the queue size and control how many launches you queue up. The easiest thing that comes to mind for me is to install stream events between each batch of launches, and use a stream callback to issue a new batch. You could decide your queue size is, say, 300 launches. Launch a batch of 10 at a time, for example, insert a stream callback after each batch, and stop issuing launches once you’ve sent 300 of them. Then every time you get a callback, issue 10 more launches. This way you’ll never have more that 300 at a time in the launch queue. This kind of scheme should keep the queue full enough to enjoy all the batching benefits. I could definitely be wrong here, I’m speculating, but I would guess this should also be able to test the power theory, that even with a limited queue size you might still get the Kernel Power events.


David.

Hi @dhart!

Thanks for your answer. Exhaustive as always!

CUDA launch calls will be come synchronous until the queue isn’t full anymore.

After some more testing I can confirm that I get synchronous behaviour when I launch over approx 1000 launches.

I do have a setting to input an event to the stream between each launch which feeds back to a progress bar. Now I’ve noticed that if I run with this on it won’t crash. Even though everything is still queued up at once (10 000 launches and 10 000 events no problem). The progress bar is updated on another thread to not affect the launch. Maybe the card doesn’t reach the same power consumption when it has to report events between each kernel.

The launch count before it crashes also seems to depend a bit on the resolution.
500x500 px = launch limit approx 1000.
1000x1000 px = launch limit approx 500-1000.
3000x3000 px = launch limit approx 50-100.

Keeping the total quesize down by using events is a good suggestion. But it feels like that would be to bypass the problem rather than solving it given that what I’m getting isn’t intended behaviour.

Regards
Oscar

Okay, this is interesting that it stops crashing with a stream event. I’m not sure if this points in either direction, it might be reducing power consumption and preventing the crash, but it also might be suggesting my theory is wrong and that a new PSU won’t fix the problem.

I guess it’s worth asking if you know that your PSU is too small for the peak power consumption of all the components in your system? Should your current one be big enough on paper, or is it clear that if the CPU & GPU & disk & lights are all at full power draw at the same time, then there might not be enough?

It’d be really interesting to check an Nsight Systems profile with and without your stream callback. The thing to look for is how packed together the launches are, how much time there is between launches. Newer versions of Nsight Systems have a graph that shows you the overall GPU utilization.

Another thing to try is to run nvidia-smi in a tight loop while you repro both the crash behavior and the non-crash when your stream callback is turned on. nvidia-smi will show you the power consumption, and maybe you’ll be able to see a difference in the sum total power during your program run. (I’m sure there’s an API for querying power too, if you wanted to put code in your application that measures usage.) If you can find a way to run your GPU that runs at equal or higher power without crashing than your run when it does crash, that would tend to rule out the PSU being the problem.

Maybe having a callback between every launch does slow it down enough to cross a wattage threshold, but I’m not sure I’m confident that would change the power consumption enough to be seeing these effects reliably. If you’re doing your callback between every single launch, it might be worth testing a callback after a multiple-launch batch at a time, to get higher density of launches between the callbacks.

Another thing you could try is queue up more than 1 stream, and still use a callback between launches or batches. The goal being to make sure the GPU is working on another kernel on stream 2 while servicing the callback on stream 1, to basically saturate the GPU to see if you can trigger the power event before 1000 launches, to try to rule out whether having a full kernel launch queue is causing the problem.


David.

Thanks for your suggestions @dhart.

The PSU is probably on the edge, hard to say any type of calculator I try gives a suggestion of about 650w and we have 750w. But I might be missing something.

The crash does not seem to be connected to the launch count, I’m sometimes getting crashes if I launch just a couple of very heavy launches too (maybe I should have done more extensive testing before asking my initial question). The stream callbacks were probably only helping to pull down the consumption when there were many smaller launches.

The machine has now also been crashing when running machine learning tasks with pytorch. So I’ve handed over the problem to our IT support for now.

Regards
Oscar

1 Like

Now it sounds like a bigger PSU might indeed fix the issue. :) Importantly from my perspective, this seems to rule out an issue with the launch queue being full, so thank you for the update! Questions before extensive testing are welcome, I only hope my speculation was more helpful than not helpful. Good luck with the machine, I hope your IT sorts it out.


David.

1 Like