I’m writing a program that uses a progressive render approach to render fractals, and need to saturate the device, but still have it remain responsive, say limit individual kernel runtime to less than 15 ms.
The traditional approach is to issue smallish batches, but since various GPUs are vastly different in power from others, especially spanning generations, and since depending on the specific fractal in question iterations can also take significantly different amounts of time. As such, I’d expect the runtime of a static batch to vary by as much as 2 orders of magnitude(!) between a cheap fractal on a high end system and an expensive fractal on a low end system. This amount of variance means choosing a batch size that simultaneously keeps a high end device busy but doesn’t slow a system with a low end device down to a crawl is a big problem.
Is there a good way to achieve time slicing of this sort of workload? One idea is to use clock() to periodically check if a given time span has elapsed and stop the kernel if it has. A problem with this is that clock speeds not only vary between devices, but also with time according to the whims of the driver. Also, there’s weirdness with the possibility of a some block starting much later than others, therefore having its timer start late. This could be handled by setting a flag any time any block hits its time limit. Then every block would check the flag in addition to its own timer.
Another possibility is having a host timer, and then signaling the GPU somehow. Is there a good way to do this? Can the GPU see modifications to host memory through zero-copy? What about the various memcpy functions? If the host calls an async memcpy, will the device see the copied memory during the kernel execution, or will it potentially have to wait until the kernel terminates for the transfer to occur? The fact that modern GPUs have asynchronous copy engines suggests that they should be able to see a memcpy halfway through executing a kernel, but…
A final concern is getting everything synced with the monitor refresh. Ideally, we’d have a single kernel slice for every frame sent to the monitor, but this needs accurate timing to minimize idle time between frames. Another approach is to make the time slices shorter, say 1 ms, to ensure there are breaks between kernels to sneak in UI updates with a fairly even number of iterations in between (especially important for real time animation!). Is this sort of duration short enough that I have to worry about overhead of kernel launches eating into useful work?
What sort of suggestions does everyone have for dealing with this?