Ray tracing: GLSL vs CUDA/OpenCL

Hello everybody!

I have been asked to implement an efficient ray tracing algorithm in my app.

After some surfing on the web and some thinking I have identified two hardware-accelerated ways to do this:

  1. Parallel implementation using CUDA/OpenCL
  2. Implementation based on OpenGL SL. An example can be found here: http://www.clockworkcoders.com/oglsl/rt/index.html

Alternative #1 is truely parallel, thus I believe it would help me to make the most of the modern GPUs.
As for alternative #2, I am not sure if it will be parallelized internally by graphic driver or not. I guess it might be, as calculatiins are indepndent for each pixel…

Does anybody have an idea if shaders are executed in parallel for several pieces of input data?
I am sorry if this is a stupid question, but I need to know this in order to make a well-grounded decision.

Thanks a lot in advance for answering! :)

Pixel shaders are executed in parallel, but it’s much easier to write an efficient ray tracer in CUDA.

Try this:

Thanks a lot!
This is more or less how I explained it to myself.

Is it correct to say that number of CUDA cores for some GPU is the same as number of shader processors?
Is the amount of work done in parallel approximately the same for the both approaches (GLSL and CUDA/OpenCL)?

Both shaders and CUDA use almost entirely the same hardware on the chip. Shader languages provide an abstraction that makes your code portable to any 3D device, but by replacing the abstraction with one that brings you closer to the hardware, you can sometimes write more efficient code. This is especially true if you need a feature that CUDA directly exposes to the developer, like shared memory.

No, the memory traffic can be much bigger. I compared my blob analysis (connected component is main subroutine) code against several others and 1 was written in GLSL by some LSU masters student and I believe the performance was > 100x slower.

It seemed the main problem was that he needed to make multiple passes over the input and write the output back to memory and read it back again in the next stage, instead of doing as much as possible in 1 function before writing back out. He had to run a shader just to generate the x & y coordinates for each pixel, while CUDA easily provides that in threadIdx & blockIdx.

This simply means that he doesn’t know shader language at all. For example in GLSL fragment shader is integrated variable gl_TexCoord[0].xy which gives pixel coordinates.

Here’s the link GPU connected component labeling

Like you said, the author isn’t an expert with GLSL since the speed on a 9800GTX only barely matches that of OpenCV on a dual core. But there are other inefficiencies. GLSL doesn’t let you use shared RAM or synchronize threads, which probably means more global memory traffic.