Compute shader causing internal compiler error

A compute shader using atomics keeps failing to compile and also has caused a few crashes. Trying to use atomicCounterCompSwapARB() and atomicCounterExchangeARB() in various combos trying to get around it also cause similar problems.

Shader Program Error: Compute info

Internal error: assembly compile error for compute shader at offset 11781:
– error message –
line 472, column 11: error: unknown opcode modifier
– internal assembly text –
OPTION NV_shader_atomic_counters;
OPTION NV_internal;
OPTION NV_gpu_program_fp64;
OPTION NV_shader_storage_buffer;
OPTION NV_bindless_texture;
OPTION ARB_shader_image_size;

cgc version 3.4.0001, build date Jul 10 2016

command line args:

#vendor NVIDIA Corporation
#version COP Build Date Jul 10 2016
#profile gp5cp
#program main
#semantic PointBuffer : SBO_BUFFER[0]
#semantic FaceBuffer : SBO_BUFFER[1]
#semantic VolumeMap : IMAGE[0]
#semantic PointDispatchCount : COUNTER[0]0
#semantic PointDispatchCount2 : COUNTER[0]1
#semantic PointDispatchCount3 : COUNTER[0]2
#semantic PointCount : COUNTER[0]3
#semantic FaceCount : COUNTER[1]0
#var uint3 gl_GlobalInvocationID : $vin.GBLID : GBLID[3] : -1 : 1
#var uint points[0].position : : sbo_buffer[0][0] : -1 : 1
#var uint points[0].colour : : sbo_buffer[0][4] : -1 : 1
#var uint points[0].adjacency : : sbo_buffer[0][8] : -1 : 1
#var uint4 faces[0] : : sbo_buffer[1][0] : -1 : 1
#var ulong VolumeMap.__handle : : c[0] : -1 : 1
#var uint PointDispatchCount : COUNTER[0]0 : counter[0][0] : -1 : 1
#var uint PointDispatchCount2 : COUNTER[0]1 : counter[0][1] : -1 : 0
#var uint PointDispatchCount3 : COUNTER[0]2 : counter[0][2] : -1 : 0
#var uint PointCount : COUNTER[0]3 : counter[0][3] : -1 : 1
#var uint FaceCount : COUNTER[1]0 : counter[1][0] : -1 : 1
PARAM c[1] = { program.local[0] };
STORAGE sbo_buf0 = {[0] };
STORAGE sbo_buf1 = {[1] };
COUNTER atomic_counter0 = { program.counter[0] };
COUNTER atomic_counter1 = { program.counter[1] };
TEMP R0, R1, R2, R3, R4;
PK64.U D0.x, c[0];
LOADIM.U32 R0, invocation.globalid, handle(D0.x), 3D;
MOV.U R1, R0;
BFE.U R0.y, {8, 8, 0, 0}.x, R0.x;
BFE.U R0.z, {8, 16, 0, 0}, R0.x;
BFE.U R0.w, {8, 24, 0, 0}, R0.Failed to compile GL program

In fact from moving things around and trying many things atomic counters seem overall flakey in general and trying to use extended counter ops gives me lots of system hangs :-(

Specifically I seem to have problem with atomicCounter() returning the wrong result verified by sucking the buffer back to the CPU to check. I’ve tried a million variations on barriers just to be sure and things still seem wrong.

I’ve managed to work around the problem running another compute shader that binds the atomic counters as SSBO and writes to a dispatch indirect parameter buffer via SSBO.

I’m not sure if it’s something wrong with the shader compiler when it encounters atomic operations surrounded by conditionals that are judged from texture reads?

Could you please add some more system information details?

Any report of this kind needs to list at least these basic information:
OS version, installed GPU(s), display driver version.

It would also be helpful to get the input compute shader to be able to check where the failure comes from.

It’s on a Win7 64bit PC with a GTX 1080, i7-3770k 32GB RAM. Latest public drivers 368.81 as of today and OpenGL 4.5 context.

At the moment I have worked around one instance of the problem with an extra compute shader which is something like:

layout(local_size_x= 1, local_size_y= 1, local_size_z= 1) in;
layout(binding = 0, std430) readonly restrict buffer CountBuffer
	UInt count;
layout(binding = 1, std430) writeonly restrict buffer DispatchBuffer
	UInt batchCount[3];
void main()
	UInt dispatchCount = count;

	dispatchCount = dispatchCount /GroupSize + UInt(dispatchCount % GroupSize != 0U);

	batchCount[0] = dispatchCount ;
	batchCount[1] = 1;
	batchCount[2] = 1;

As for the actual troublesome compute shaders it’s a little bit difficult to extract them from my current mess at the moment. But they are something like:

//I have tried variations of putting multiple counters in one buffer and also separate ones...
layout(binding = 0, offset = 0)	uniform atomic_uint SomeOtherCount;
layout(binding = 1, offset = 0)	uniform atomic_uint Count;
layout(binding = 2, offset = 0)	uniform atomic_uint DispatchCount;

void main()
uvec4 colour = imageLoad(VolumeMap, position);
int test2 = FuncWithMultipleImageReads();

if(colour.a > 0 && test2 != 0x3F)
 points[atomicCounterIncrement(Count)] = ...
 otherList[atomicCounterIncrement(SomeOtherCount)] = ...

//Tried various peppering of the code with barriers() just in case...


if(gl_GlobalInvocationID == ivec3(0, 0, 0))
//This counter read seems to screw up - have checked by reading it back to the CPU
UInt finalDispatchCount = atomicCounter(Count);


finalDispatchCount = finalDispatchCount /GroupSize + UInt(finalDispatchCount % GroupSize != 0U);


//Tried various combos of writing the dispatch count via atomics and also a SSBO but all seem to read the atomic counter value wrong above?

//atomicCounterAddARB(DispatchCount, finalDispatchCount );
//atomicCounterExchangeARB(DispatchCount, finalDispatchCount );
//atomicCounterCompSwapARB(DispatchCount, 0, finalDispatchCount );


I think work group size is 8^3

Sorry for that mess but I am typing from memory as I don’t have it in front of me right now either.

If try and introduce atomicCounterCompSwapARB() anywhere the shader compiler dies and also trying things like atomicCounterExchangeARB() have randomly killed the display driver and even PC.

I am not quite sure why I started to use atomic counters to write the dispatch indirect count when SSBO write would suffice but still it seems to not like something. I suspect it’s maybe the conditionals around atomic ops that are dependent on image read results?

I littered the client C code with ALL barriers in trying to get it to work. But only the extra compute shader glue makes it happen. With the dispatch count randomly zero or some number that is not right.

No problem, take your time. We would need a minimal reproducer in failing state.
I’m just trying to gather as precise error information as possible before pointing the compiler team at this issue.

Whilst pondering how best to produce a minimal repo case I am getting possibly another issue as I sort through the mess for that.

I am encountering spurious full system crashes using glDispatchComputeIndirect(0) whose parameters are written by the first compute shader patchwork/fix I posted above. I continue to have the code full of glMemoryBarrier(GL_ALL_BARRIER_BITS) calls for sanity. If I read the dispatch parameters back from the GPU they are correct, and if I then use them for a CPU side dispatch everything seems to work fine - they are dispatches of size:

dispatch [15199 1 1]
dispatch [3606 1 1]
dispatch [835 1 1]
dispatch [192 1 1]
dispatch [43 1 1]
dispatch [10 1 1]
dispatch [2 1 1]

The compute shader has local size [128 1 1] currently.

I have no errors from the GL debug context and I can’t seem to debug/step a GTX1080 with nsight and check what buffers are thought to be bound etc :-(

The crashes either take out the display driver if I am lucky but more often or not they take down the whole system which makes for tedious iteration. It does appear like things work correctly the first few seconds - as I see the visual result of the compute shader as intended but then that vanishes and things suddenly get slow and boom usually.

Some further info on this as a dig deeper as I am having trouble extracting a simple repo for this…

I have something of the form of:

for(int outerIndex = 0; outerIndex < outerCount; ++outerIndex)
	glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0, buffers->readonly1);
	glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 1, buffers->dispatch);
	glDispatchCompute(1, 1, 1);

	glBindBuffer(GL_DISPATCH_INDIRECT_BUFFER, buffers->dispatch);

	for(int innerIndex = 0; innerIndex < innerCount; ++innerIndex)

		glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0, buffers->readonly1);
		glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 1, buffers->dispatch);
		glDispatchCompute(1, 1, 1);

		glBindBuffer(GL_DISPATCH_INDIRECT_BUFFER, buffers->dispatch);

		glProgramUniform1ui(computePrograms->draw, glGetUniformLocation(computePrograms->draw, "Flag"), flag);
		glBindTextures(0, 1, &textures->volume);
		glBindBufferBase(GL_UNIFORM_BUFFER, 0, buffers->readonly2);
		glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0, buffers->readwrite1);
		glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 1, buffers->readonly3);
		glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 2, buffers->readonly4);
		glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 3, buffers->readonly5);


Where if I move the code outside of the innerloop using that #ifdef it will cause a driver crash. This is obviously a separate issue to the shader compiler failure.

So it seems if I try to reuse an indirect buffer without changing the contents it causes a crash, but if I write it each time before use it’s ok?

Also I just tried with the 369 beta drivers and it still crashes out. I also tried with a GTX680 card and it still crashes. No errors flagged from the GL debug context etc and even with memory barriers changed to full ones everywhere.

If I #define it to crash but replace the glDispatchComputeIndirect(0) with glDispatchCompute(15199, 1, 1) it then works without a crash (15199 is the max items and an if guard in the shader caps them etc).

I plugged in an RX480 to sanity check my indirect dispatches and everything functions correctly with no errors from the GL debug context. So I am pretty sure(!) you have a serious system crash bug in your driver regards glDispatchComputeIndirect() from the GTX680 up to GTX1080.

I can’t check the other atomic problem that failed in your shader compiler against their shader compiler with the RX480 as it doesn’t support the right extensions.