How to improve the efficiency of readpixels on Tegra？

MagicalFingers · March 7, 2018, 2:50pm

I used pbo in openGL to improve efficiency.

glReadBuffer(GL_FRONT);
    if(pboUsed) // with PBO
    {
        // read framebuffer ///////////////////////////////
        t1.start();
        // copy pixels from framebuffer to PBO
        // Use offset instead of ponter.
        // OpenGL should perform asynch DMA transfer, so glReadPixels() will return immediately.
        glBindBufferARB(GL_PIXEL_PACK_BUFFER_ARB, pboIds[index]);
        glReadPixels(0, 0, SCREEN_WIDTH, SCREEN_HEIGHT, PIXEL_FORMAT, GL_UNSIGNED_BYTE, 0);
        // measure the time reading framebuffer
        t1.stop();
        readTime = t1.getElapsedTimeInMilliSec();
        ///////////////////////////////////////////////////
        // process pixel data /////////////////////////////
        t1.start();
        // map the PBO that contain framebuffer pixels before processing it
        glBindBufferARB(GL_PIXEL_PACK_BUFFER_ARB, pboIds[nextIndex]);
        GLubyte* src = (GLubyte*)glMapBufferARB(GL_PIXEL_PACK_BUFFER_ARB, GL_READ_ONLY_ARB);
        if(src)
        {
            // change brightness
            add(src, SCREEN_WIDTH, SCREEN_HEIGHT, shift, colorBuffer);
            glUnmapBufferARB(GL_PIXEL_PACK_BUFFER_ARB);     // release pointer to the mapped buffer
        }
        // measure the time reading framebuffer
        t1.stop();
        processTime = t1.getElapsedTimeInMilliSec();
        ///////////////////////////////////////////////////
        glBindBufferARB(GL_PIXEL_PACK_BUFFER_ARB, 0);
    }

    else        // without PBO
    {
        // read framebuffer ///////////////////////////////
        t1.start();

        glReadPixels(0, 0, SCREEN_WIDTH, SCREEN_HEIGHT, PIXEL_FORMAT, GL_UNSIGNED_BYTE, colorBuffer);

        // measure the time reading framebuffer
        t1.stop();
        readTime = t1.getElapsedTimeInMilliSec();
        ///////////////////////////////////////////////////

        // covert to greyscale ////////////////////////////
        t1.start();

        // change brightness
        add(colorBuffer, SCREEN_WIDTH, SCREEN_HEIGHT, shift, colorBuffer);

        // measure the time reading framebuffer
        t1.stop();
        processTime = t1.getElapsedTimeInMilliSec();
        ///////////////////////////////////////////////////
    }

The source code is uploaded in the compressed package below.

PBO ON:
x86:0.079ms to readpixels , 1.9ms to process.
59fps,(GeForce GTX750 Ti)

Tegra:0.125ms to readpixels , 30.063ms to process.
30fps,(tx1)

PBO OFF:
x86:15.346ms to readpixels , 0.565ms to process.
59fps,(GeForce GTX750 Ti)

Tegra:0.365ms to readpixels , 2.102ms to process.
245fps,(tx1)

It is slower when I use fpo to send datas on Tegra.
How to improve the efficiency of readpixels on Tegra？

pboPack.zip (610 KB)

WayneWWW · March 8, 2018, 3:10am

Hi MagicalFingers,

Do you focus on fps when you pointed out “It is slower when I use fpo to send datas on Tegra.”?

The pbo usage seems correct.

It looks like the bottleneck is in “process” which is using cpu. Did you pull up the cpu clock through jetson_clock.sh?

MagicalFingers · March 8, 2018, 11:34am

Hi WayneWWW,
Yes,I pulled up the cpu clock through jetson_clock.sh,
Outputs of tegrastats are as folows:

RAM 1317/3995MB (lfb 421x4MB) cpu [0%,0%,0%,0%]@1734 EMC 14%@1600 AVP 0%@80 NVDEC 268 MSENC 268 GR3D 19%@998 EDP limit 1734
RAM 1317/3995MB (lfb 421x4MB) cpu [23%,24%,12%,77%]@1734 EMC 14%@1600 AVP 0%@80 NVDEC 268 MSENC 268 GR3D 29%@998 EDP limit 1734
RAM 1317/3995MB (lfb 421x4MB) cpu [8%,29%,8%,95%]@1734 EMC 13%@1600 AVP 0%@80 NVDEC 268 MSENC 268 GR3D 29%@998 EDP limit 1734
RAM 1317/3995MB (lfb 421x4MB) cpu [4%,33%,6%,94%]@1734 EMC 13%@1600 AVP 0%@80 NVDEC 268 MSENC 268 GR3D 29%@998 EDP limit 1734
RAM 1317/3995MB (lfb 421x4MB) cpu [0%,3%,30%,96%]@1734 EMC 13%@1600 AVP 0%@80 NVDEC 268 MSENC 268 GR3D 25%@998 EDP limit 1734
RAM 1317/3995MB (lfb 421x4MB) cpu [24%,9%,5%,95%]@1734 EMC 13%@1600 AVP 0%@80 NVDEC 268 MSENC 268 GR3D 21%@998 EDP limit 1734

but Tegra still spends 30.063ms to process ;
When I exchange codes of function “add” to “memcpy(dst,src,widthheight4)”,
compared with “Not using PBO”,“using pbo” still makes fps lower.
I have to do “memcpy” in my project ,how to improve the fps ?

WayneWWW · March 9, 2018, 2:43am

Could you share the time elapsed for “add” or “memcpy” only? Let’s separate the Bind and Mapping function time.

I wonder the time of data sync between PBO cause the problem.

869422100 · August 16, 2023, 2:09am

Hi WayneWWW,

We encountered a similar issue on the tx2-nx (did not encounter on xavier-nx, where the time for map addresses and normal addresses operations is similar). We conducted two tests:

We used mmap to map the kernel address of the camera to the user address, like ptr=mmap(NULL, len , PROT_READ|PROT_WRITE, MAP_SHARED , fd , 0), then conducted a memcpy test on 7 sets of 1780x720x2 yuyv data. The results are as follows:

mmap user address1 → user address2: 30.3ms
Normal user address1 → user address2: 5.6ms

We used OpenGL’s glMapBuffer to map the GPU address to the CPU address for 1 set of 1920x1080x4 data, like GLubyte *ptr = (GLubyte *)glMapBuffer(GL_PIXEL_PACK_BUFFER, GL_READ_ONLY), then conducted a memcpy test. The results are as follows:

glMapBuffer CPU address1 → CPU address2: 16.3ms
Normal CPU address1 → CPU address2: 3.9ms

Memcpy should not be so slow, or maybe there is something wrong with my usage. How can I optimize it?

869422100 · August 16, 2023, 2:32am

According to our test, Bind and Mapping is almost instantaneous.

Topic		Replies	Views
How to improve the efficiency of readpixels on Tegra？ OpenGL	0	923	March 7, 2018
readPixels performance CUDA Programming and Performance	2	2052	December 1, 2008
doubts about transferring/mapping framebuffer textures to cuda space CUDA Programming and Performance	3	2829	March 23, 2010
Draw PBO into the screen : performance OpenGL	10	5244	June 21, 2013
cudaGLMapBufferObject (and unmap) performance These calls take way too long CUDA Programming and Performance	47	76559	February 14, 2010
Performance drawback: Copying data from PBO too slow Jetson TX2 opengl	5	1474	October 18, 2021
OpenGL in 3.0 CUDA Programming and Performance	3	5234	March 26, 2010
PBO/glReadPixels/cudaGLMapBufferObject performance difference between vista and linux CUDA Programming and Performance	1	18330	January 12, 2010
OpenGL performance issue. glReadPixels and cudaGLMapBufferObject bad performance. CUDA Programming and Performance	2	6280	March 24, 2010
Per-pixel rendering (post process OpenGL) CUDA Programming and Performance	5	3959	April 25, 2007

How to improve the efficiency of readpixels on Tegra？

Related topics