OpenGL/DirectX vs. CUDA vs. their hybrid which one can take the best advantage of the GPU hardware

A question I have always been confused about – which one can take the best advantage of the GPU hardware, OpenGL or CUDA, or their hybrid? (let’s temporally forget about which is the easiest to program)

For example, one huge advantage of CUDA is that it has full ability to program the shared memory in each stream processor. However, I am not sure whether CUDA can also make good use of the rest of hardware features on the GPU as well as OpenGL/DirectX. I mean, is there a significant amount of chip area dedicated to hardwired 3D graphic rendering engine that is not accessible from CUDA, or at least not as efficiently as OpenGL/DirectX?

I am kind of trying to imagine that the whole chip area of a GPU can be divided into three parts. Part A can be used by both CUDA and OpenGL/DirectX with similar efficiency. Part B can be used much more efficiently by OpenGL/DirectX. Part C can be used much more efficiently by CUDA. Is this understanding correct?

In addition, I am curious whether there is any architectural change between a CUDA-enabled conventional GPU and a specifically designed GPGPU such as the Tesla series?

Thanks a lot,


It probably depends on your application. CUDA can simply do more, so even if it’s slower for elementwise kernels, it doesn’t matter. You don’t have easy scatter ops with shaders, and no syncthreads, so it would be difficult to take advantage of shared memory.

The chip is “dual personality”; the same hardware runs CUDA and DirectX code. CUDA also has access to a lot of features like the texture units. I’m not sure about the rasterization though… I haven’t had to do anything like it, but I don’t recall any way to do this from CUDA.


There is no architectural difference in the GPU, only adjustments in the total memory, clock rate, and the presence/absence of video connectors. The original Tesla used the same GPU as the 8800 GTX, and the current Tesla uses the same GPU as the GTX 280 or 285. (The difference between the 280 and 285 being the 65 or 55 nm fabrication process. Not sure which the Tesla C1060 is.)

All 10-series Teslas are 55nm.

The majority of the chip area these days is taken up by the shader units (multiprocessors), so you’re not missing that much from CUDA, mainly just rasterization. The compute mode of the chip does exposure hardware such as shared memory and global load/store that is not directly available from the graphics APIs.

That said, there are still some applications that perform better using OpenGL / Direct3D, we don’t pretend CUDA is ideal for everything!

I do get better understanding from all your replies. Thank everyone.