OpenGL 4.4 very slow - OpenGL 1.1 very fast - Performance Problem Quadro K4200/K2000

Are there any known issues with the Quadro K4200/2000 card running on Dell Precision Workstations (dual 8 core E5-2630 Xeon and single 6 core Xeons), WHQL 354.56 driver on Windows 10 that would cause the performance of OpenGL 4.4 to be much slower than the same identical data, displayed on OpenGL 1.1 on the same hardware? A similar problem has been observed on the GeForce 740M with WHQL 361.43 driver.

OpenGL 1.1 using the old, deprecated, client arrays is extremely fast. First update is in the range of 0.32 sec with subsequent updates on the same data in 0.04-0.09 sec. The code processes all data the same way on each update and does not cache any display data. With OpenGL 1.1, the video card appears to be caching triangles in the GPU after the first update. OpenGL 4.4 performance appears gated and is 5-10 times slower, typically OpenGL 4.4 display times are in the 0.93-1.25 sec per update, each update takes about the same amount of time in OGL 4.4.

The same data run on the HP Envy notebook (i7 4900HQ) with a GeForce 740M card with the 361.43 driver Win 10, using OGL 4.4 is 2-3x faster than the Quadro K4200M card at between 0.46-0.62 sec. Similarly the update time on the HP notebook using OGL 1.1 is much faster. After the first update the update time is in the 0.04-0.12 sec range.

Our application (x64) has just been ported from OpenGL 1.0 forward to a dual system using OpenGL 1.1 for legacy cards and OpenGL 4.4 for newer cards that can support persistent, coherent, pinned memory mapping through glMapBufferRange. Where possible on non-OpenGL calls, the 2 code paths are sharing commmon code. The 2 code paths are sharing common code for building buffers of data to display. The 2 versions of OpenGL very carefully follow separate paths where necessary to set up the correct OpenGL state for each version. No deprecated functions are being called in the OGL 4.4 code path. Simple, smooth shading with a single directional light and no material properties are being used.

Unfortunately, nsight 5.0 doesn’t work on this app - it seems to have the same stability issues other users were reporting in November.
The Visual Studio 2015 profiler results are inconclusive and vary strangely depending on how the profiler is run (sampling, instrumented, run from exe project or one of the app’s dlls).

More details:

Both paths use the ARB create context, the compatability profile, and the same pixel format. Everything displays correctly and consistently on both versions.

Build Tools: Application Built with Visual Studion 2015 Update 1 - x64 build, MFC app with CodeJock, mostly single threaded at this point.

Systems have plenty of RAM 16GB to 32 GB - OS and app use ~6GB when running.

Screen Resolution: 1920x1080 (both workstations and HP notebook). Workstation has dual monitors, only displaying on one monitor.

Function Wrangler is a small C++ object created from the libEpoxy OGL 4.5 header files which were extracted from the Khronos definition. The Epoxy dll had build issues on Windows. The app is being careful to make a specific function pointer table per context, etc. Seems to work fine.

ARB Profile: Compatability (runs a bit faster than the Core 4.4 profile)
Pixel Format: 10 (Quadro), Full hardware acceleration, Double buffered, RGBA, 24 bit depth, 8 bits each R,G,B,A, 8 bit stencil buffer
Display Type: GL_TRIANGLES (fill, shaded)
Polygon Mode: GL_FRONT_AND_BACK, GL_FILL
Draw Buffer : Back
Swap Call: SwapBuffers
Draw Call:
- OGL 1.1 glDrawElements with a simple sequential index
- OGL 4.4 glDrawArrays passing pointer to pinned memory subbuffer. Have also tried glDrawElements with OGL 4.4 - same performance as glDrawArrays.
Data Buffer Size:
- OGL 1.1 a maximum of 1026 vertices or 342 triangles per glDrawElements call.
- OGL 4.4 have tried a range of sizes from the 1026 size to 32769 bigger is somewhat better.

Lighting: Single directional light with diffuse, ambient terms only.

Vertex Data: Double precision vertices, normal and texture data and 4 bytes of RGBA loaded per vertex and converted to float (same data as for OGL 1.1 path)
Shader: Have tried both 2 stage (vertex, fragment with math on vertex) and 3 stage (vertex, geometry, fragement with triangle math in geometry shader)
Have also tried the simplest possible pass thru shader which made little or no difference in the OGL 4.4 performance.
Shaders are loaded at startup and compiled and linked once.
For lighting 2 4x4 float matrices and 4 float light parameters are loaded as uniforms. The pass through shader loads one matrix.

Persistent GPU Buffer:
- Triple buffer approach, shown as the best method on various web sites and examples
- Memory allocated one time and held for duration of the run.
- glFenceSync used as recommend with appropriate wait when changing buffers.
- Keeping data in local buffers and doing a single memcpy to pinned memory for the GPU to avoid many small transfers and possible performance issues that might cause. memcpy is done just before the glDrawArrays call.

Data Size: 130,000 - 260,000 triangles is typical load for these tests.

OpenGL 4.4 Code snippet on buffer creation (one time only):

typedef struct
{
GLdouble gldVertex3D[3];
GLdouble gldNormal3D[3];
GLdouble gldTexCoord2d[2];
GLubyte glubColorRGBA[4];

}GLDrawGeometry;

if( m_sVBO.sGLBuffers.pvDrawGeometry == NULL )
{
	GLuint		glsziSubBufferNumberOfElements	= OPENGL_VERTEX_BLOCK_LIST_SIZE;
	GLuint		glsziSubBufferSize		= glsziSubBufferNumberOfElements * sizeof(GLDrawGeometry);
	GLuint		glsziBufferSize			= glsziSubBufferSize * MAX_NUMBER_OF_GPU_SUB_BUFFERS;
	GLbitfield	glbFlagsStorage			= GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT;
	GLbitfield	glbFlagsMapRange		= GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT;

	// --------------------------------------------------------------
	//	Create an immutable data store for the geometry buffer.
	// --------------------------------------------------------------
	glBufferStorage( GL_ARRAY_BUFFER, glsziBufferSize, NULL, glbFlagsStorage );

	// --------------------------------------------------------------
	//	Map the buffer forever, this buffer is immutable for the run.
	// --------------------------------------------------------------
	m_sVBO.nCurrentGPUSubBuffer			= 0;
	m_sVBO.sGLBuffers.glsziGeometryBufferSizeInBytes= 0;
	m_sVBO.sGLBuffers.pvDrawGeometry		= (GLDrawGeometry *)glMapBufferRange(GL_ARRAY_BUFFER, 0, glsziBufferSize, glbFlagsMapRange );

	if( m_sVBO.sGLBuffers.pvDrawGeometry == NULL )
	{
		return( E_OUTOFMEMORY );  
	}

	m_sVBO.sGLBuffers.glsziGeometryBufferSizeInBytes = glsziBufferSize;
}

I appreciate any insight or advice on what could be causing this performance slowdown on the Quadro K4200 card on Windows 10 x64. And why OpenGL 1.1 is faster on both the Quadro and the GeForce cards.

Thanks,

Elaine Acree

Never mind. Fixed it. Cast data to float just prior to calling glDrawArrays.
OpenGL 1.1 was taking the same data and apparently casting to float far more efficiently than OpenGL 4.4. And far from a 50% performance hit, it was more like 5-10X performance hit depending on the card and driver combination. It’s now doing wheelies in dynamics -).