Why can't we have blending engine

One of the main issue that narrow down the available applications in CUDA is atomic floating point support. Even up to GTX 280, we don’t have even simple atomic floating add function. However, with people working on traditional GPGPU, most of the problem with atomic floating point can be solved optimally with blending function, and we know even the old card have blending engine. The question is why we have to pay lots of money for expensive card while we can not use even very simple function.

CPUs have FPU to support floating point function, and even SSE to support vector calculation, so why we can not have blending unit to support atomic floating point function.

I’m very upset when I spend lots of time optimizing my CUDA function, that includes very fast sorting and segmented sum function(which I believe fastest available out there), to turn out that it is 4 time slower than the very simple, straightforward, DirectX code, and even slower when visualization involve due to the inefficient OpenGL/directX interoperation . I lost my point why we need CUDA, why we spent more time debugging code and optimizing while a much better and simple solution out there.

There are ways to do atomic floats in shared memory:
http://forums.nvidia.com/index.php?showtopic=72925

CUDA can indeed be frustrating, because you have to change your whole design perspective. In most cases where your bottleneck is atomics, sometimes you have to rearrange things to be a reduction or prefix-sum operation instead.

You’re also right that we’d all love access to the blending ALU. It’s possible to use it indirectly now by sampling from an interpolated texture, but it’s rather indirect to use.

Thank you. Code looks great, but i’m pretty sure that it is far from the optimal blending. Have you ever try this and compare with segmented prefix and sort approach. Or compare performance of your atomic float function with blending engine.

If you have a lot of data, a log2 reduction is almost always the right way to handle it. Atomics are much less efficient if a lot of threads have data, they take time proportional to the number of threads with data.

Where atomics are useful is when just a few threads may have information and you just need to make sure they get serviced without collision.

Can you tell me more detail about log2 reduction technique. Is that the technique mentioned in reduction example. Most of the time we don’t know the exact number of elements, can we use similar thing to perform splatting or general histogram calculation.

Right, the reduction example is pretty well documented. (the sum prefix one is also useful to understand.)

The idea is that if you’re boiling down an entire array of data into a single value (by adding them, or finding the max or whatever) you can do it in log2 steps… each step basically combining two elements at a time, the next iteration combining two of THOSE results and so on until the whole dataset has been condensed to your single value answer.

Reduction can work with a variable number of elements too.

For histogram, you may look at the histogram CUDA project example. There’s also a thread here somewhere about increasing the parallelism of a 256 wide image histogram.

For splatting… I’m not sure what you mean, a sparse histogram that has to deal with duplicates, or 3D rendering splatting.

Someone mentioned that the atomic operations during creation of the histogram cause slowdowns if there are many conflicts (say, lots of pixels with identical values). This translates ot lots of atomic writes to the same memory address

I propose to have 8 partial histograms where threadId.x % 8 determines the array to write to. Since always 8 threads execute in parallel on the chip, you hereby eliminate those conflicts by giving each thread a different location to write to. In the end you would have to reduce the 8 arrays into one. That can be done either on the CPU or in a second kernel call.

Christian

uhhhh… you are way, way off with the number of threads active on the chip at once, unless you’re talking about shared memory (even then you should be more concerned about a half warp (16 threads) instead of eight threads). But then, no, that can’t be done with a second kernel call because shared memory is not persistent across kernels.

just the active threads per SM and clock cycle… Now about the half warp you’re probably right concerning writes to global memory. Anyway, back to the drawing board for me.

Can I somehow determine at run time on which of the SMs a block is executing?

Nope, you cannot.

I can see blending engine as a very effective way to solve general histogram problem. I see a paper about general histogram calculation that is the extension of 256 bin on cuda SDK, it requires multipas
over the range, of course is a hack and far from optimal. 256 is not enough for many application, for example editing transfer function, and we can not use the same strategy for 2D histogram.

I can not come up with any idea that can effectively solve histogram problem in CUDA, although it is a very basic image processing function, while most of image processing related functions can be solved effectively and quite straightforward with traditional GPGPU.