How to improve the efficiency of readpixels on Tegra?

I used pbo in openGL to improve efficiency.

glReadBuffer(GL_FRONT);
    if(pboUsed) // with PBO
    {
        // read framebuffer ///////////////////////////////
        t1.start();
        // copy pixels from framebuffer to PBO
        // Use offset instead of ponter.
        // OpenGL should perform asynch DMA transfer, so glReadPixels() will return immediately.
        glBindBufferARB(GL_PIXEL_PACK_BUFFER_ARB, pboIds[index]);
        glReadPixels(0, 0, SCREEN_WIDTH, SCREEN_HEIGHT, PIXEL_FORMAT, GL_UNSIGNED_BYTE, 0);
        // measure the time reading framebuffer
        t1.stop();
        readTime = t1.getElapsedTimeInMilliSec();
        ///////////////////////////////////////////////////
        // process pixel data /////////////////////////////
        t1.start();
        // map the PBO that contain framebuffer pixels before processing it
        glBindBufferARB(GL_PIXEL_PACK_BUFFER_ARB, pboIds[nextIndex]);
        GLubyte* src = (GLubyte*)glMapBufferARB(GL_PIXEL_PACK_BUFFER_ARB, GL_READ_ONLY_ARB);
        if(src)
        {
            // change brightness
            add(src, SCREEN_WIDTH, SCREEN_HEIGHT, shift, colorBuffer);
            glUnmapBufferARB(GL_PIXEL_PACK_BUFFER_ARB);     // release pointer to the mapped buffer
        }
        // measure the time reading framebuffer
        t1.stop();
        processTime = t1.getElapsedTimeInMilliSec();
        ///////////////////////////////////////////////////
        glBindBufferARB(GL_PIXEL_PACK_BUFFER_ARB, 0);
    }

    else        // without PBO
    {
        // read framebuffer ///////////////////////////////
        t1.start();

        glReadPixels(0, 0, SCREEN_WIDTH, SCREEN_HEIGHT, PIXEL_FORMAT, GL_UNSIGNED_BYTE, colorBuffer);

        // measure the time reading framebuffer
        t1.stop();
        readTime = t1.getElapsedTimeInMilliSec();
        ///////////////////////////////////////////////////

        // covert to greyscale ////////////////////////////
        t1.start();

        // change brightness
        add(colorBuffer, SCREEN_WIDTH, SCREEN_HEIGHT, shift, colorBuffer);

        // measure the time reading framebuffer
        t1.stop();
        processTime = t1.getElapsedTimeInMilliSec();
        ///////////////////////////////////////////////////
    }

The source code is uploaded in the compressed package below.

PBO ON:
x86:0.079ms to readpixels , 1.9ms to process.
59fps,(GeForce GTX750 Ti)

Tegra:0.125ms to readpixels , 30.063ms to process.
30fps,(tx1)

PBO OFF:
x86:15.346ms to readpixels , 0.565ms to process.
59fps,(GeForce GTX750 Ti)

Tegra:0.365ms to readpixels , 2.102ms to process.
245fps,(tx1)

It is slower when I use fpo to send datas on Tegra.
How to improve the efficiency of readpixels on Tegra?

pboPack.zip (610 KB)

Hi MagicalFingers,

Do you focus on fps when you pointed out “It is slower when I use fpo to send datas on Tegra.”?

The pbo usage seems correct.

It looks like the bottleneck is in “process” which is using cpu. Did you pull up the cpu clock through jetson_clock.sh?

Hi WayneWWW,
Yes,I pulled up the cpu clock through jetson_clock.sh,
Outputs of tegrastats are as folows:

RAM 1317/3995MB (lfb 421x4MB) cpu [0%,0%,0%,0%]@1734 EMC 14%@1600 AVP 0%@80 NVDEC 268 MSENC 268 GR3D 19%@998 EDP limit 1734
RAM 1317/3995MB (lfb 421x4MB) cpu [23%,24%,12%,77%]@1734 EMC 14%@1600 AVP 0%@80 NVDEC 268 MSENC 268 GR3D 29%@998 EDP limit 1734
RAM 1317/3995MB (lfb 421x4MB) cpu [8%,29%,8%,95%]@1734 EMC 13%@1600 AVP 0%@80 NVDEC 268 MSENC 268 GR3D 29%@998 EDP limit 1734
RAM 1317/3995MB (lfb 421x4MB) cpu [4%,33%,6%,94%]@1734 EMC 13%@1600 AVP 0%@80 NVDEC 268 MSENC 268 GR3D 29%@998 EDP limit 1734
RAM 1317/3995MB (lfb 421x4MB) cpu [0%,3%,30%,96%]@1734 EMC 13%@1600 AVP 0%@80 NVDEC 268 MSENC 268 GR3D 25%@998 EDP limit 1734
RAM 1317/3995MB (lfb 421x4MB) cpu [24%,9%,5%,95%]@1734 EMC 13%@1600 AVP 0%@80 NVDEC 268 MSENC 268 GR3D 21%@998 EDP limit 1734

but Tegra still spends 30.063ms to process ;
When I exchange codes of function “add” to “memcpy(dst,src,widthheight4)”,
compared with “Not using PBO”,“using pbo” still makes fps lower.
I have to do “memcpy” in my project ,how to improve the fps ?

Could you share the time elapsed for “add” or “memcpy” only? Let’s separate the Bind and Mapping function time.

I wonder the time of data sync between PBO cause the problem.

Hi WayneWWW,

We encountered a similar issue on the tx2-nx (did not encounter on xavier-nx, where the time for map addresses and normal addresses operations is similar). We conducted two tests:

  1. We used mmap to map the kernel address of the camera to the user address, like ptr=mmap(NULL, len , PROT_READ|PROT_WRITE, MAP_SHARED , fd , 0), then conducted a memcpy test on 7 sets of 1780x720x2 yuyv data. The results are as follows:

mmap user address1 → user address2: 30.3ms
Normal user address1 → user address2: 5.6ms

  1. We used OpenGL’s glMapBuffer to map the GPU address to the CPU address for 1 set of 1920x1080x4 data, like GLubyte *ptr = (GLubyte *)glMapBuffer(GL_PIXEL_PACK_BUFFER, GL_READ_ONLY), then conducted a memcpy test. The results are as follows:

glMapBuffer CPU address1 → CPU address2: 16.3ms
Normal CPU address1 → CPU address2: 3.9ms

Memcpy should not be so slow, or maybe there is something wrong with my usage. How can I optimize it?

According to our test, Bind and Mapping is almost instantaneous.