Help optimizing a vertical compute blur pass

As a way of learning compute shaders, and practicing graphics programming, I’ve been writing a compute based cubemap blur for use in a realtime variance shadow mapping algorithm. Since shadow filtering can get away with a linear blur, and I wanted high blur widths, I decided to go for a moving average style blur, where one thread is assigned a row (or column), and moves along it averaging results, letting me get away with linear time on arbitrary blur widths. I also chose this approach to limit unnecessary repeated memory access. Performance for the horizontal blur pass, is surprisingly good, taking around 0.3 ms to blur four cubemaps (point lights), with a resolution of 512x512 per face. The problem comes in the vertical pass, which for whatever reason, despite being nearly identical save for progressing in the Y axis instead of the X axis, takes anywhere from 2-3x longer.

I decided to use Nsight to profile the application, and found that while L2 throughput is around 99% on the horizontal pass, the vertical one (Denoted by the dispach started marker in the attached image) seems to vary from anywhere around 70-60%. All other throughput metrics are also similarly lower.

Is cache just less able to handle accessing memory in a vertical pattern? Are there any workarounds or tricks I could use to speed up the second pass?


Hi m3kabin,

Generally speaking yes, striding in the Y direction will be slower than X. Buffers will have the worst cache locality, while surfaces are better. The lower throughput% is indicative of being latency limited.

A few ideas to try:

  1. Transpose the horizontal pass’ output as it’s being written out, so that the vertical pass is another horizontal pass (that also transposes on the way out). The transposed write pattern only needs to outrun the slower vertical pass to gain an improvement; it’s also possible that it will improve write coalescing,
  2. Try using the Real-Time Shader Profiler option in GPU Trace to identify which lines of shader code are taking the longest. It is most likely the load from memory, but it is worth verifying that.
  3. Search for similar techniques, such as large matrix multiplies, “separable 2D convolutions”, “2D IIR filters”, and downsampling techniques in DSP or graphics. Some of these are tile based algorithms, which can take advantage of shared memory to gain a speedup.

To learn more about GPU caches and memory coalescing, I suggest looking at CUDA literature and presentations; the same GPU behaviors largely apply to graphics. “Global memory” in CUDA is the same as a “Buffer” in graphics.

Hope that helps!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.