Per-pixel rendering (post process OpenGL)


I have a task of per-pixel rendering some physical effects with CUDA. Something like raytracer…

For start point I used PostProcess example by Simon (thanks a lot).

I have to say that Simon’s example runs slow…

Every frame in this example we read from the pbo_source and change this, and write to pbo_dest, then swap the buffers.

I modificate so that without pbo_source I render in pbo_dest some image in real-time. It is just a simple sin cos processing of pixel colors with 1000 per-pixel iterations depending on x, y and time.

But during the interchanging of frames the image is twiching. It is near by 10 FPS via 1024x768 threads and 16x16 block size.

In Cg + OpengGL there is now frames twiching.

May be I have this issue because of per frame kernel calls with 1024x768 threads and per frame DeviceToHost PBO copies? I am not good with OpenGL, but I thought that PBO objects a situated in G80 device, and there is no huge data transfers between host and device.

How can I write direct in device back frame buffer via CUDA kernel?

Test fully code:

__global__ void cudaProcess(int * g_odata, int imgw, int imgh, float time)


    unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;

    unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;


	float red = 0;

    float green = 0;

    float blue = 0;

	// calculate uv coordinates

    float u = x / float(imgw);

    float v = y / float(imgh);

    u = u * 2 * __sinf(__log2f(4 + __sinf(time)) * __cosf(time) * (1 - u) * __cosf(__sinf(time) * u)) - 1;

    v = v * 2 * __cosf(__log2f(u * u * v) * (1 - v));

   // calculate simple sine wave pattern

    float freq = 2.0;

	for(int i = 0; i < 1000; i++)


  time += __cosf(__expf(u * u * v)) * __sinf(__cosf(v * i) / (__expf(v + u + i * i) * u));


	red += abs(__sinf(u * freq + time) * 255.0);

	green += abs(__cosf(v * freq + time) * 255.0);

	blue += int(time) % 255;

	g_odata[y * imgw + x] = rgbToInt(red, green, blue);


And where >300 GFlops? I avaluate it like 10-30 GFlops… (10FPS).

Other code like in Simon’s example without reading into pbo_source.

There is no way to directly write to the framebuffer from CUDA, you need to write to mapped PBO memory like the postProcessGL sample does.

How are you calculating the GFlops number? It looks like you’re doing about 15 flops in the inner loop.

15 * 1000 * 1024 * 768 * 10fps / 1e9 = 118 GFlops.

The loop overhead in CUDA can be quite high, you would likely get much higher performance if you unroll the inner loop.

I don’t see any of the “twiching” you describe. Can you file a bug, attaching the code?

BTW, I like your pretty sine wave pattern :)

Well, I can’t imagine how I calculated the performance the other day :). May be I was a bit tired.
And there are 17 operations in inner loop. Don’t forget i < 1000 and i++ :)

I can’t catch the twitching with print screen. But I try to describe it in details:
The image is smooth. But from the ~550 frame starts such a situation: a horizontal border discontinuously appears and moves from top to the bottom, bottom to the top. It is like a border between i and i+1 frame, where i frame lies below the border, and i+1 above the border. The border is thickless.

When pbo_source read operations are ON (like in your sample), there is no twitching. But when I turn off it, the twiching starts again.

Cg+OpenGl version of such a programm looks like the same in performance, but there is no twitching there.

I would like to answer about another 200 GFlops :). They are missing due to the inner loop and conveyer length of the multiprocessors? The loop is constant…

Maybe these 200 GFlops are missed due to the device global memory operations (g_odata[y * imgw + x] = …)?

Thank you for support!
testing_sinwave.rar (789 KB)

Hi, do you have vsync disabled, by any chance? If so, you will see what’s known as “tearing” if your app frame rate is greater than the refresh rate of your monitor.

Enabling vsync will fix this, but it will cause the frame rate to be artificially limited to the monitor refresh rate.


I have tested with vsync enabled and disabled for all applications and for my only. Result is the same.

Only this trick in OpenGL display-function guarantees the smooth frames-exchange:

glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, pbo_source);

	glReadPixels(0, 0, 1, 1, GL_RGBA, GL_UNSIGNED_BYTE, NULL);

I don’t know why :). Without this trick I have this twitching in some cases. With trick I have never noticed them.

Do you happen to be writing to a buffer while you are texturing from it?