How to improve the efficiency of readpixels on Tegra?

I used pbo in openGL to improve efficiency.

glReadBuffer(GL_FRONT);
    if(pboUsed) // with PBO
    {
        // read framebuffer ///////////////////////////////
        t1.start();
        // copy pixels from framebuffer to PBO
        // Use offset instead of ponter.
        // OpenGL should perform asynch DMA transfer, so glReadPixels() will return immediately.
        glBindBufferARB(GL_PIXEL_PACK_BUFFER_ARB, pboIds[index]);
        glReadPixels(0, 0, SCREEN_WIDTH, SCREEN_HEIGHT, PIXEL_FORMAT, GL_UNSIGNED_BYTE, 0);
        // measure the time reading framebuffer
        t1.stop();
        readTime = t1.getElapsedTimeInMilliSec();
        ///////////////////////////////////////////////////
        // process pixel data /////////////////////////////
        t1.start();
        // map the PBO that contain framebuffer pixels before processing it
        glBindBufferARB(GL_PIXEL_PACK_BUFFER_ARB, pboIds[nextIndex]);
        GLubyte* src = (GLubyte*)glMapBufferARB(GL_PIXEL_PACK_BUFFER_ARB, GL_READ_ONLY_ARB);
        if(src)
        {
            // change brightness
            add(src, SCREEN_WIDTH, SCREEN_HEIGHT, shift, colorBuffer);
            glUnmapBufferARB(GL_PIXEL_PACK_BUFFER_ARB);     // release pointer to the mapped buffer
        }
        // measure the time reading framebuffer
        t1.stop();
        processTime = t1.getElapsedTimeInMilliSec();
        ///////////////////////////////////////////////////
        glBindBufferARB(GL_PIXEL_PACK_BUFFER_ARB, 0);
    }

    else        // without PBO
    {
        // read framebuffer ///////////////////////////////
        t1.start();

        glReadPixels(0, 0, SCREEN_WIDTH, SCREEN_HEIGHT, PIXEL_FORMAT, GL_UNSIGNED_BYTE, colorBuffer);

        // measure the time reading framebuffer
        t1.stop();
        readTime = t1.getElapsedTimeInMilliSec();
        ///////////////////////////////////////////////////

        // covert to greyscale ////////////////////////////
        t1.start();

        // change brightness
        add(colorBuffer, SCREEN_WIDTH, SCREEN_HEIGHT, shift, colorBuffer);

        // measure the time reading framebuffer
        t1.stop();
        processTime = t1.getElapsedTimeInMilliSec();
        ///////////////////////////////////////////////////
    }

The source code is uploaded in the compressed package below.

PBO ON:
x86:0.079ms to readpixels , 1.9ms to process.
59fps,(GeForce GTX750 Ti)

Tegra:0.125ms to readpixels , 30.063ms to process.
30fps,(tx1)

PBO OFF:
x86:15.346ms to readpixels , 0.565ms to process.
59fps,(GeForce GTX750 Ti)

Tegra:0.365ms to readpixels , 2.102ms to process.
245fps,(tx1)

It is slower when I use fpo to send datas on Tegra.
How to improve the efficiency of readpixels on Tegra?

pboPack.zip (610 KB)

Hi MagicalFingers,

Do you focus on fps when you pointed out “It is slower when I use fpo to send datas on Tegra.”?

The pbo usage seems correct.

It looks like the bottleneck is in “process” which is using cpu. Did you pull up the cpu clock through jetson_clock.sh?

Hi WayneWWW,
Yes,I pulled up the cpu clock through jetson_clock.sh,
Outputs of tegrastats are as folows:

RAM 1317/3995MB (lfb 421x4MB) cpu [0%,0%,0%,0%]@1734 EMC 14%@1600 AVP 0%@80 NVDEC 268 MSENC 268 GR3D 19%@998 EDP limit 1734
RAM 1317/3995MB (lfb 421x4MB) cpu [23%,24%,12%,77%]@1734 EMC 14%@1600 AVP 0%@80 NVDEC 268 MSENC 268 GR3D 29%@998 EDP limit 1734
RAM 1317/3995MB (lfb 421x4MB) cpu [8%,29%,8%,95%]@1734 EMC 13%@1600 AVP 0%@80 NVDEC 268 MSENC 268 GR3D 29%@998 EDP limit 1734
RAM 1317/3995MB (lfb 421x4MB) cpu [4%,33%,6%,94%]@1734 EMC 13%@1600 AVP 0%@80 NVDEC 268 MSENC 268 GR3D 29%@998 EDP limit 1734
RAM 1317/3995MB (lfb 421x4MB) cpu [0%,3%,30%,96%]@1734 EMC 13%@1600 AVP 0%@80 NVDEC 268 MSENC 268 GR3D 25%@998 EDP limit 1734
RAM 1317/3995MB (lfb 421x4MB) cpu [24%,9%,5%,95%]@1734 EMC 13%@1600 AVP 0%@80 NVDEC 268 MSENC 268 GR3D 21%@998 EDP limit 1734

but Tegra still spends 30.063ms to process ;
When I exchange codes of function “add” to “memcpy(dst,src,widthheight4)”,
compared with “Not using PBO”,“using pbo” still makes fps lower.
I have to do “memcpy” in my project ,how to improve the fps ?

Could you share the time elapsed for “add” or “memcpy” only? Let’s separate the Bind and Mapping function time.

I wonder the time of data sync between PBO cause the problem.