As a way of learning compute shaders, and practicing graphics programming, I’ve been writing a compute based cubemap blur for use in a realtime variance shadow mapping algorithm. Since shadow filtering can get away with a linear blur, and I wanted high blur widths, I decided to go for a moving average style blur, where one thread is assigned a row (or column), and moves along it averaging results, letting me get away with linear time on arbitrary blur widths. I also chose this approach to limit unnecessary repeated memory access. Performance for the horizontal blur pass, is surprisingly good, taking around 0.3 ms to blur four cubemaps (point lights), with a resolution of 512x512 per face. The problem comes in the vertical pass, which for whatever reason, despite being nearly identical save for progressing in the Y axis instead of the X axis, takes anywhere from 2-3x longer.
I decided to use Nsight to profile the application, and found that while L2 throughput is around 99% on the horizontal pass, the vertical one (Denoted by the dispach started marker in the attached image) seems to vary from anywhere around 70-60%. All other throughput metrics are also similarly lower.
Is cache just less able to handle accessing memory in a vertical pattern? Are there any workarounds or tricks I could use to speed up the second pass?