I’m trying to optimize rendering on a Tegra X1 GPU. Our lighting is based on standard Forward+ approach. Visible lights are culled per 8x8 pixels tile using a compute shader. Later, during rendering of an object, I can fetch a bitmask of visible lights (lightIndices) for any rendered pixel and light it accordingly. Shader compiler static analysis reports 5 divergent branches when lighting code is present. If I replace fetching of lightIndices from a texture by assigning a uniform value from a constant buffer, number of divergent branches goes down to 0 and some other values, like latency and throughput limiters are improved as well. And because on other platforms I was able to reduce VGPRs usage by a simple scalarization of lighting, I used similar approach here as well.
I’ve enabled GL_KHR_shader_subgroup extension and used subgroupOr(lightIndices) or even subgroupBroadcastFirst(subgroupOr(lightIndices)) to get the same lightIndices bit mask for the whole warp/subgroup and expected GLSL shader compiler to treat this value as a uniform, reducing divergency. Unfortunately nothing like that happened and there is no measurable improvement in a GPU capture. Why is that so?