One of the main issue that narrow down the available applications in CUDA is atomic floating point support. Even up to GTX 280, we don’t have even simple atomic floating add function. However, with people working on traditional GPGPU, most of the problem with atomic floating point can be solved optimally with blending function, and we know even the old card have blending engine. The question is why we have to pay lots of money for expensive card while we can not use even very simple function.
CPUs have FPU to support floating point function, and even SSE to support vector calculation, so why we can not have blending unit to support atomic floating point function.
I’m very upset when I spend lots of time optimizing my CUDA function, that includes very fast sorting and segmented sum function(which I believe fastest available out there), to turn out that it is 4 time slower than the very simple, straightforward, DirectX code, and even slower when visualization involve due to the inefficient OpenGL/directX interoperation . I lost my point why we need CUDA, why we spent more time debugging code and optimizing while a much better and simple solution out there.