Time slicing a large kernel

I’m writing a program that uses a progressive render approach to render fractals, and need to saturate the device, but still have it remain responsive, say limit individual kernel runtime to less than 15 ms.

The traditional approach is to issue smallish batches, but since various GPUs are vastly different in power from others, especially spanning generations, and since depending on the specific fractal in question iterations can also take significantly different amounts of time. As such, I’d expect the runtime of a static batch to vary by as much as 2 orders of magnitude(!) between a cheap fractal on a high end system and an expensive fractal on a low end system. This amount of variance means choosing a batch size that simultaneously keeps a high end device busy but doesn’t slow a system with a low end device down to a crawl is a big problem.

Is there a good way to achieve time slicing of this sort of workload? One idea is to use clock() to periodically check if a given time span has elapsed and stop the kernel if it has. A problem with this is that clock speeds not only vary between devices, but also with time according to the whims of the driver. Also, there’s weirdness with the possibility of a some block starting much later than others, therefore having its timer start late. This could be handled by setting a flag any time any block hits its time limit. Then every block would check the flag in addition to its own timer.

Another possibility is having a host timer, and then signaling the GPU somehow. Is there a good way to do this? Can the GPU see modifications to host memory through zero-copy? What about the various memcpy functions? If the host calls an async memcpy, will the device see the copied memory during the kernel execution, or will it potentially have to wait until the kernel terminates for the transfer to occur? The fact that modern GPUs have asynchronous copy engines suggests that they should be able to see a memcpy halfway through executing a kernel, but…

A final concern is getting everything synced with the monitor refresh. Ideally, we’d have a single kernel slice for every frame sent to the monitor, but this needs accurate timing to minimize idle time between frames. Another approach is to make the time slices shorter, say 1 ms, to ensure there are breaks between kernels to sneak in UI updates with a fairly even number of iterations in between (especially important for real time animation!). Is this sort of duration short enough that I have to worry about overhead of kernel launches eating into useful work?

What sort of suggestions does everyone have for dealing with this?

Run a single test kernel at app start-up. Time the duration of that kernel. Use that to scale subsequent kernel launch “sizes”.

Yes, a GPU can see changes to zero-copy memory.

I second txbub’s recommendation.

For a static scheme, one could start with the observation that humans usually consider a system responsive if the response occurs within 0.1 seconds = 100 msec (there are publications about this, but I do not want to put in the effort right now to track them down). So that is the upper limit one would target for a slow GPU. The performance gap between the fastest and the slowest GPU in use at any given time does not usually exceed 20x, giving you 5 msec per kernel on the fastest GPUs around, which is still a reasonable execution time compared to the minimum possible (5 usec lower limit of launch overhead).

Given how sluggish Folding@Home makes machines with low-end GPUs, I assume that even a widely distributed application like that uses a static rather than a calibrated work-partitioning scheme.

Use multiple streams with different priorities.
You could e.g. have one stream that produces a quarter resolution image synchronized to each frame with time to spare, and a second lower priority stream that fills in intermediate pixels, but might potentially not finish on time.
Then it needs a bit of compositing to blend in the higher res data where available (too bad the z buffer isn’t exposed - maybe OpenGL interop could provide some help there?).

A similar scheme would be possible treating pixels with high iteration count, where the higher priority kernel only performs a limited number of iterations, and pushes them into a queue for the lower priority kernel where the iteration limit is reached.

On top of that, you can dynamically adjust parameters depending on how early/late the lower priority kernel finishes.

The fractal works (very roughly speaking) by iterating random points and plotting/accumulating them to the screen. In this way, the final image gets progressively less noisy, in much the same manor as something like path tracing. Interestingly enough, some of the optimizations that apply to path tracing also apply to the fractal render, like Metropolis sampling to locate points that actually land in the viewport when it’s zoomed in to a tiny area. Anyway, a work unit can be arbitrarily large or small, and the render could be concluded at any time.

One of the major design goals is for the user to be able to enter their own equations to render fractals from. This makes static task partitioning troublesome since one moment they might be rendering a simple Buddabrot fractal, z=>z^2+z0, and another moment they might be playing with something horrifying, like z=>e^((z^5-2z^2)/(z^2+2*z0))-sin(tan(z))cosh(2z^2+c)/(atan((3+2i)*z)-cos(z0)). Clearly it’s possible to get orders of magnitude different performance just by changing the fractal. In particularly pathogenic cases, you could have a fractal change in complexity as a function of each initial randomly sampled point! Although in practice, iterating multiple points in each warp would just mean you’d usually just hit the worst case. Except for Metropolis sampling… Still, it would usually be reasonable to do a timing run each time the equation changes. Or just use clock().

Another consideration is that I would like to have real time previewing/rendering of fractals while the user manipulates parameters. This puts a hard limit on kernel time in order to maintain the target framerate, which is specified by the user since it’s a tradeoff between smoothness of animation and render quality.

Multiple streams with priorities is a very interesting idea, since it allows useful things like continuing to edit fractals and use the program even while a final render is in progress. Some priority juggling would be in order, since you want the editor preview to have high priority while the user is actively manipulating controls, low priority (but still receiving some execution time) when it’s just refining the preview, and completely off if the editor window is out of focus. Is there a way to have “soft” stream priority, where high priority streams are simply picked for execution more often than low priority streams?

Does Cuda expose a good way to manipulate predicate state? An example would be when you have a partially divergent warp, and one or more of the iterated points hit the viewport and can be selected as a source for doing some Metropolis sampling. For the sake of efficiency, it would be better to store the current predicate, then perform the additional samples with a full warp, then revert the predicate afterward. Another example would be a rand() function with a large state. The state should be updated by a full warp, then the program needs to return to whatever state it was in before some threads called rand(). In PTX land, the setp and selp instructions seem to do the job(?), but avoiding inline PTX is preferred.