Radix Sort - OpenGL Compute Shader

Hello, I implement radix sort on OpenGL Compute Shader
https://github.com/cNoNim/radix-sort
I have AMD GPU and algorithm perfectly works.
But same tests executable on NVidia GPU have failed with some array size.
I tried debug, but… I don’t why tests failed.
I tried different array sizes. With big array tests failed all time.
I tried to simplify task.
https://github.com/cNoNim/radix-sort/tree/simple
In this branch sorted increasing sequence of unsigned int… key only sort.
On AMD it’s work perfectly

OpenGL 4.5.13399 Compatibility Profile Context 16.201.1151.1007
        ATI Technologies Inc.
        AMD Radeon HD 6700M Series
count   67108864 elapsed   47376119 ticks 4.73761190 sec speed   14165124 per sec        - PASSED
count   33554432 elapsed   29163211 ticks 2.91632110 sec speed   11505739 per sec        - PASSED
count   16777216 elapsed   14703350 ticks 1.47033500 sec speed   11410471 per sec        - PASSED
count    8388608 elapsed    7364902 ticks 0.73649020 sec speed   11389979 per sec        - PASSED
count    4194304 elapsed    3611232 ticks 0.36112320 sec speed   11614606 per sec        - PASSED
count    2097152 elapsed    1817044 ticks 0.18170440 sec speed   11541558 per sec        - PASSED
count    1048576 elapsed     950502 ticks 0.09505020 sec speed   11031812 per sec        - PASSED
count     524288 elapsed     626875 ticks 0.06268750 sec speed    8363517 per sec        - PASSED
count     262144 elapsed     267127 ticks 0.02671270 sec speed    9813459 per sec        - PASSED
count     131072 elapsed     146703 ticks 0.01467030 sec speed    8934513 per sec        - PASSED
count      65536 elapsed      90163 ticks 0.00901630 sec speed    7268613 per sec        - PASSED
count      32768 elapsed      67645 ticks 0.00676450 sec speed    4844112 per sec        - PASSED
count      16384 elapsed      53292 ticks 0.00532920 sec speed    3074382 per sec        - PASSED
count       8192 elapsed      46483 ticks 0.00464830 sec speed    1762364 per sec        - PASSED
count       4096 elapsed      41232 ticks 0.00412320 sec speed     993403 per sec        - PASSED
count       2048 elapsed      41612 ticks 0.00416120 sec speed     492165 per sec        - PASSED
count       1024 elapsed      37630 ticks 0.00376300 sec speed     272123 per sec        - PASSED
COMPLETE

but on NVidia

OpenGL 4.5.0 NVIDIA 358.87
        NVIDIA Corporation
        GeForce GT 720/PCIe/SSE2/3DNOW!
count   67108864 elapsed   51996058 ticks 5.19960580 sec speed   12906529 per sec        - FAILED
count   33554432 elapsed   46898167 ticks 4.68981670 sec speed    7154742 per sec        - FAILED
count   16777216 elapsed   23762439 ticks 2.37624390 sec speed    7060393 per sec        - FAILED
count    8388608 elapsed   11960835 ticks 1.19608350 sec speed    7013396 per sec        - FAILED
count    4194304 elapsed    6035597 ticks 0.60355970 sec speed    6949277 per sec        - FAILED
count    2097152 elapsed    3058258 ticks 0.30582580 sec speed    6857341 per sec        - FAILED
count    1048576 elapsed    1597550 ticks 0.15975500 sec speed    6563650 per sec        - FAILED
count     524288 elapsed     874843 ticks 0.08748430 sec speed    5992938 per sec        - FAILED
count     262144 elapsed     523463 ticks 0.05234630 sec speed    5007880 per sec        - PASSED
count     131072 elapsed     322490 ticks 0.03224900 sec speed    4064374 per sec        - PASSED
count      65536 elapsed     224269 ticks 0.02242690 sec speed    2922205 per sec        - PASSED
count      32768 elapsed     178335 ticks 0.01783350 sec speed    1837440 per sec        - PASSED
count      16384 elapsed     156417 ticks 0.01564170 sec speed    1047456 per sec        - PASSED
count       8192 elapsed     145962 ticks 0.01459620 sec speed     561241 per sec        - PASSED
count       4096 elapsed     139295 ticks 0.01392950 sec speed     294052 per sec        - PASSED
count       2048 elapsed     136576 ticks 0.01365760 sec speed     149953 per sec        - PASSED
count       1024 elapsed     135250 ticks 0.01352500 sec speed      75711 per sec        - PASSED
COMPLETE

Can anybody help to understand the causes of such behavior?

I tried output intermediate result, and compare difference with AMD. But fails can be not in first step of radix sort. And I don’t understand… What else can I try?

There is an OpenGL forum here:

https://devtalk.nvidia.com/default/board/69/opengl/

I see that there are questions about GLSL there.

You might try posting there.

Yes, questions about GLSL, but also about accelerated computing. :)
Can I move thread in other devtalk?
Or just recreate there?