How to improve the efficiency of readpixels on Tegra?

I used pbo to improve efficiency.

glReadBuffer(GL_FRONT);
    if(pboUsed) // with PBO
    {
        // read framebuffer ///////////////////////////////
        t1.start();
        // copy pixels from framebuffer to PBO
        // Use offset instead of ponter.
        // OpenGL should perform asynch DMA transfer, so glReadPixels() will return immediately.
        glBindBufferARB(GL_PIXEL_PACK_BUFFER_ARB, pboIds[index]);
        glReadPixels(0, 0, SCREEN_WIDTH, SCREEN_HEIGHT, PIXEL_FORMAT, GL_UNSIGNED_BYTE, 0);
        // measure the time reading framebuffer
        t1.stop();
        readTime = t1.getElapsedTimeInMilliSec();
        ///////////////////////////////////////////////////
        // process pixel data /////////////////////////////
        t1.start();
        // map the PBO that contain framebuffer pixels before processing it
        glBindBufferARB(GL_PIXEL_PACK_BUFFER_ARB, pboIds[nextIndex]);
        GLubyte* src = (GLubyte*)glMapBufferARB(GL_PIXEL_PACK_BUFFER_ARB, GL_READ_ONLY_ARB);
        if(src)
        {
            // change brightness
            add(src, SCREEN_WIDTH, SCREEN_HEIGHT, shift, colorBuffer);
            glUnmapBufferARB(GL_PIXEL_PACK_BUFFER_ARB);     // release pointer to the mapped buffer
        }
        // measure the time reading framebuffer
        t1.stop();
        processTime = t1.getElapsedTimeInMilliSec();
        ///////////////////////////////////////////////////
        glBindBufferARB(GL_PIXEL_PACK_BUFFER_ARB, 0);
    }

    else        // without PBO
    {
        // read framebuffer ///////////////////////////////
        t1.start();

        glReadPixels(0, 0, SCREEN_WIDTH, SCREEN_HEIGHT, PIXEL_FORMAT, GL_UNSIGNED_BYTE, colorBuffer);

        // measure the time reading framebuffer
        t1.stop();
        readTime = t1.getElapsedTimeInMilliSec();
        ///////////////////////////////////////////////////

        // covert to greyscale ////////////////////////////
        t1.start();

        // change brightness
        add(colorBuffer, SCREEN_WIDTH, SCREEN_HEIGHT, shift, colorBuffer);

        // measure the time reading framebuffer
        t1.stop();
        processTime = t1.getElapsedTimeInMilliSec();
        ///////////////////////////////////////////////////
    }

The source code is uploaded in the compressed package below.

PBO ON:
x86:0.079ms to readpixels , 1.9ms to process.
59fps,(GeForce GTX750 Ti)

Tegra:0.125ms to readpixels , 30.063ms to process.
30fps,(tx1)

PBO OFF:
x86:15.346ms to readpixels , 0.565ms to process.
59fps,(GeForce GTX750 Ti)

Tegra:0.365ms to readpixels , 2.102ms to process.
245fps,(tx1)

It is slower when I use fpo to send datas on Tegra.
How to improve the efficiency of readpixels on Tegra?

pboPack.zip (610 KB)