The following compute shader puts incorrect (1, 2, 3) values to the last column of the output matrix:
#version 450
layout(std140, binding = 4) restrict writeonly buffer Output { mat3x4 outMatrix[]; };
void translate(inout mat4 m, in vec3 v)
{
m[3].xyz += v;
}
mat4 run()
{
vec3 d = vec3(1,2,3);
mat4 result = mat4(1.0);
// result[3].xyz += d;
translate(result, d);
while (true) {
if (true) {
// result[3].xyz += d;
translate(result, d);
break;
}
}
return result;
}
layout(local_size_x = 1) in;
void main() {
mat4 outMat = transpose(run());
outMatrix[gl_LocalInvocationID.x] = mat3x4(outMat[0], outMat[1], outMat[2]);
}
If we replace translate(result, d);
with equivalent result[3].xyz += d;
the problem goes away and the last column is set to (2, 4, 6).
This is a minimal program that reproduces the bug:
It prints xyz of the last column of the matrix two times. The first line corresponds to the shader with translate(result, d);
and the second one is for result[3].xyz += d;
. It compiles easily with CMake on both Windows and Linux.
On GeForce RTX 2060 with 566.36 driver on both Windows and Linux it prints (last two lines):
1 2 3
2 4 6
Also tried to launch on Intel HD Graphics 2500, AMD Vega 7, AMD Vega 8 and Mesa llvmpipe. All of them print
2 4 6
2 4 6
as expected.
Looks very similar to this bug, but I could not reproduce it. It may have been fixed in some cases, but not in the one I encountered.
Here’s the disassembly(?) of the bugy shader obtained from glGetProgramBinary
on GeForce 2060:
OPTION NV_internal;
OPTION NV_shader_storage_buffer;
OPTION NV_bindless_texture;
GROUP_SIZE 1;
STORAGE sbo_buf0[] = { program.storage[0] };
TEMP R0;
TEMP T;
TEMP RC;
SHORT TEMP HC;
REP.S ;
SEQ.U.CC HC.x, {1, 0, 0, 0}, {0, 0, 0, 0};
BRK (NE.x);
MOV.U.CC RC.x, {1, 0, 0, 0};
BRK (NE.x);
ENDREP;
MUL.S R0.x, invocation.localid, {48, 0, 0, 0};
MOV.S R0.x, R0;
STB.F32X4 {1, 0, 0, 0}.xyyx, sbo_buf0[R0.x];
STB.F32X4 {0, 1, 2, 0}.xyxz, sbo_buf0[R0.x + 16];
STB.F32X4 {0, 1, 3, 0}.xxyz, sbo_buf0[R0.x + 32];
END
It looks like it is finally storing incorrectly precomputed values in the output buffer.