Strange poor performances using glNamedBufferSubData

Hi.

I get a strange degrading performances using glNamedBufferSubData
This is in summary the code in question:

Initializations:

#define COMPONENTS_PER_ATTRIBUTE 4
        bytesPerVertex = COMPONENTS_PER_ATTRIBUTE * sizeof(float); 
        
        glCreateVertexArrays(1, &vao);
        glCreateBuffers(1,&vbo);

        glNamedBufferStorage(vbo, storageSize, nullptr, GL_DYNAMIC_STORAGE_BIT);
        
        glVertexArrayAttribBinding(vao,locID, 0);
        glVertexArrayAttribFormat(vao, locID, COMPONENTS_PER_ATTRIBUTE, GL_FLOAT, GL_FALSE, 0);
        glEnableVertexArrayAttrib(vao, locID);        

        glVertexArrayVertexBuffer(vao, 0, vbo, 0, bytesPerVertex);

Pre Rendering (filling vbo):

auto before = std::chrono::system_clock::now();

    glNamedBufferSubData(vbo, offByte, nVtx * bytesPerVertex, vtxBuffer); 

auto now = std::chrono::system_clock::now();

No other operations between “before” and “now”

Rendering:

...
    glBindVertexArray(vao);
    glDrawArrays(GL_POINTS, 0, uploadedVtx);

I get “long” time to filling the VBO with glNamedBufferSubData.
This is the output with number of vertices and timings, in milliseconds:

Windows 10 - Driver: 398.11 & 397.31, System: 10.0.16299.125

n.vtx: 401678 - buff ms: 82.284
n.vtx: 3137 - buff ms: 0.005
n.vtx: 437395 - buff ms: 81.983
n.vtx: 3166 - buff ms: 0.005
n.vtx: 421705 - buff ms: 82.430
n.vtx: 3572 - buff ms: 0.010
n.vtx: 443809 - buff ms: 87.970
n.vtx: 3082 - buff ms: 0.004
n.vtx: 439761 - buff ms: 87.311
n.vtx: 3120 - buff ms: 0.005
n.vtx: 441736 - buff ms: 86.954
n.vtx: 3764 - buff ms: 0.004

(there is an asynchronous thread that fill the buffer vtxBuffer, and it waiting in PreRendering)

The problem does not occur if I use not DSA call, just only replacing glNamedBufferSubData with:

glBindBuffer(GL_ARRAY_BUFFER, vbo);
glBufferSubData(GL_ARRAY_BUFFER, offByte, nVtx * bytesPerVertex, vtxBuffer);

… and the the copy of the buffer is fluid and rapid:

Windows 10 - Driver: 398.11 & 397.31, System: 10.0.16299.125

n.vtx: 440362 - buff ms: 1.582
n.vtx: 496869 - buff ms: 1.843
n.vtx: 418549 - buff ms: 1.565
n.vtx: 459380 - buff ms: 1.642
n.vtx: 446853 - buff ms: 1.597
n.vtx: 443460 - buff ms: 1.617
n.vtx: 466572 - buff ms: 1.681
n.vtx: 423418 - buff ms: 1.760

Yet,
To get timings I used std::chrono functions, because if I use glBeginQuery(GL_TIME_ELAPSED, …) just before to call glNamedBufferSubData, the problem does not occur.
Same thing if I use nsight (or CodeXL), and the performance are better that without nsight (like if I use not DSA call: glBufferSubData).

For this last behavior I have found similarity with this two other topics were the performances are better in nsight:

https://devtalk.nvidia.com/default/topic/1023704/opengl/cyclic-framerate-drops/
https://devtalk.nvidia.com/default/topic/1011032/cuda-programming-and-performance/performance-is-better-about-10-when-using-nsight-visual-studio-2015-profiler-than-when-executing-the-exe/

On Linux the problem does not exist: this is timings output with: glNamedBufferSubData… with no relevant difference with not DSA call, although it is generally slower than Windows (in both ways)

Linux Fedora 27 - Driver: 390.67, Kernel: 4.16.15-200.fc27.x86_64

n.vtx: 266338 - buff ms: 1.833
n.vtx: 264983 - buff ms: 1.809
n.vtx: 261718 - buff ms: 1.966
n.vtx: 346953 - buff ms: 2.494
n.vtx: 184308 - buff ms: 1.004
n.vtx: 261846 - buff ms: 2.049
n.vtx: 261651 - buff ms: 1.854
n.vtx: 263732 - buff ms: 1.630

All tests have been made with same machine (i7 6700K/z270 chipset) and a MSI GTX 1060 6GB
The builds on Windows have been made with VS2017, on Linux gcc 7.2

Thanks and regards
Michele Morrone