OpenGL: Default Framebuffer is slower than off-screen Framebuffer

I need to display OpenGL scene (in fullscreen) as fast a possible.

On 4K monitor, displaying (rendering onto Default Framebuffer) takes 4.8 ms.
However, rendering onto off-screen Framebuffer takes only 1.1 ms.

I decided to ask here and not on OpenGL forum because on my PC (with GeForce) the timing is the same.
Which is something I’d expect, since Default Framebuffer is still a Framebuffer (only out-of-reach).
So it seems like a problem specific to Jetson TK1.

Is the slowdown of the Default Framebuffer is expected on Jetson TK1?
Or is there “a switch” to make it faster? Where can I read about it?

Couple of thoughts:

  • There is a switch (Env variable __GL_SYNC_TO_VBLANK) to enable VSync, so maybe there is a similar feature that is responsible for the slowdown.
  • X11 could cause the slowdown, but it’s unlikely because it does not affect the speed on the PC.

Here is the code:

#include <stdio.h>
#include <assert.h>
#define GL_GLEXT_PROTOTYPES
#include <GL/freeglut.h>
#include <Stopwatch.hpp>

//#define USE_FRAMEBUFFER

void Init() {
  int width = glutGet(GLUT_SCREEN_WIDTH), height = glutGet(GLUT_SCREEN_HEIGHT);
  glViewport(0, 0, width, height);

#ifdef USE_FRAMEBUFFER
  GLuint render_buffer;
  glGenRenderbuffers(1, &render_buffer);
  glBindRenderbuffer(GL_RENDERBUFFER, render_buffer);
  glRenderbufferStorage(GL_RENDERBUFFER, GL_RGBA8, width, height);
  assert(glGetError()==GL_NO_ERROR);

  GLuint frame_buffer;
  glGenFramebuffers(1, &frame_buffer);
  glBindFramebuffer(GL_FRAMEBUFFER, frame_buffer);
  glFramebufferRenderbuffer(GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0, GL_RENDERBUFFER, render_buffer);
  assert(glGetError()==GL_NO_ERROR);
  assert(glCheckFramebufferStatus(GL_FRAMEBUFFER)==GL_FRAMEBUFFER_COMPLETE);
#endif

  static GLfloat vertex[] = {0, 0,   -1, -1,   1, -1};
  glVertexPointer(2, GL_FLOAT, 0, vertex);
  static GLfloat color[] = {0, 0, 1,   0, 1, 0,   1, 0, 0};
  glColorPointer(3, GL_FLOAT, 0, color);

  glEnableClientState(GL_VERTEX_ARRAY);
  glEnableClientState(GL_COLOR_ARRAY);

  glClearColor(.2f, .2f, .2f, 1); // Background color.
}

void Display() {
  glClear(GL_COLOR_BUFFER_BIT);
  glDrawArrays(GL_TRIANGLES, 0, 3);
  glFinish();
}


int main(int argc, char *argv[]) {
  glutInit(&argc, argv);
#ifdef USE_FRAMEBUFFER
  glutInitWindowSize(640, 480); // Does not matter, since we render to off-screen buffer.
  glutCreateWindow("Demo");
#else
  glutEnterGameMode();
#endif

  Init();

  const int count = 2000;
  Stopwatch sw;
  for (int i = 0; i<count; ++i) Display();
  printf("Average display time is %.2lf ms.\n", sw.Time()/count*1000);
  /* On 4K monitor, with (without) `USE_FRAMEBUFFER` defined: 1.1 ms (4.8 ms).
  If `glFlush()` is used instead of `glFinish()`, then the time is 0.4 ms (2.0 ms). */

  return 0;
}

Compile with command g++ main.cpp -lglut -lX11 -lGL -lGLU -ldrm.

Not an answer, but perhaps information you will find useful. I don’t think the video hardware on a JTK1 is able to render to 4k via GPU. I do know 1920x1080 is supported (2k).

Desktop graphics cards have their own memory, and transfer over the PCIe bus. On Jetson, graphics does not have its own memory, it has to use main memory. The GPU is connected to that memory via the memory controller, and does not use PCIe. There are quite a few differences between transfer over PCIe and pinned memory directly on the memory HUB. I couldn’t tell you what the optimizations are, but generally speaking, if we were talking about CUDA (which has some similarities with graphics), there are several ways to access the memory, most of which have drastically different performance.

@linuxdev, thanks for the fast reply!

I have to admin I have <4K monitor, and use xrandr to emulate 4K Default Framebuffer.
Off-screen Framebuffer, however, works without any tricks.
In both cases “screenshots” (done with glReadPixels()) are 4K.
However, the slowdown is observed independently from the resolution.

Regarding external memory controller (EMC): yes, that could be the reason.
sudo bash jetson_clocks_max.sh; sudo ~/tegrastats shows 33% of EMC load for Framebuffer, and 74% (but not 100%, hmm) for the Default Framebuffer .
So it seems that the Default Framebuffer requires some additional memory transfers, which are interfered with other memory transfers, creating slowdown.