Possible glDispatchComputeGroupSizeARB bug

I have the following situation.
I dispatch a compute shader for an 2423 (width) x 1310(height) image.
If I do

glDispatchCompute((GLuint)std::ceilf(width / 39.0f), (GLuint)std::ceilf(height / 39.0f), 1);

and in the shader

layout(local_size_x = 39, local_size_y = 39) in;

everything works as expected everywhere (different workstations with different quadro cards).

If I do

const GLuint numGroupsX = static_cast<GLuint>(std::ceilf(width / static_cast<float>(39)));
const GLuint numGroupsY = static_cast<GLuint>(std::ceilf(height / static_cast<float>(39)));
glDispatchComputeGroupSizeARB(numGroupsX, numGroupsY, 1, 39, 39, 1);

and in the shader

layout( local_size_variable ) in;

the application hangs (therefore no GL error or debug outout) in the next OpenGL Command which is glDeleteTextures but only with the Quadro 4000 and more then 32 x 32 groups, with a Quadro K2000 everything is fine (39x39 or 32x32). I use Windows7 Ultimate 64 bit and the 377.35 Nvidia driver.

For the Quadro 4000:


returns 1536,1024,64 for x,y,z


returns 1536


extension is available

If I use only 32 x 32 groups everything is okay (33 x 33 hangs again).

So to me it looks like either GL_MAX_COMPUTE_VARIABLE_GROUP_INVOCATIONS_ARB returns the wrong value (maybe it should only be 1024) or the implementation of glDispatchComputeGroupSizeARB has a bug.
Or maybe I oversaw something?

Every hint is appreciated.