OpenGL Compute Shader SSBO Write Performance Issue

I’m finding that whenever I use compute shaders that operate on data structures larger than vec4 I am having huge peformance issues. Doing the following to re-order a buffer after is sort is around 800 times slower than re-ordering vec4s. I did the timing with OpenGL Timer queries.

I wonder if anybody else is using compute shaders with larger data structures and not having any issues. Or I’m doing something fundamentally wrong and/or I should expect this behaviour. Or there could be a driver issue, I am using the latest GameReady driver which is 388.00 with a GTX 1060 6gb.

I’ve tried packing with only vec4s but not to any success, a slightly larger struct causes slightly slower execution (1.9ms-2.0ms). Writing the buffer only without re-ordering, or writing in the elements separately causes no change in execution speed. The only thing that does is the size of the struct written.

Here’s an example of a very slow shader.

#version 430

precision mediump float;

struct ConvexHull{
  vec3  position;
  uint enabled;
  vec3  half_ex;
  uint hash;
  vec4  verts_0[8];
  vec4  planes_n[6];
  vec4  planes_d[6];

layout(local_size_x = 128) in;

layout(binding = 0, std430) readonly buffer In {
  ConvexHull in[];

layout(binding = 1, std430) writeonly buffer Out {
  ConvexHull out[];

layout(binding = 2, std430) readonly buffer SortData {
  uvec4 sort_buf[];

void main() {
  uint index = gl_GlobalInvocationID.x;
  out[index] = in[sort_buf[index].y];

The more I look at the, the more potential I think there is there could be a driver issue. If I don’t use double buffering I get a speed up of 100 times, then for every vec4 I eliminate from writing to the struct the speed doubles. The size of the buffer or struct makes no difference, only the number of bytes written each time.