Write masks and combiners?

I am wondering if there is any fast way (intrinsics, exposed HW support) for write masking or write combining in the context of blending.

The Intel x86 has a SSE instruction MASKMOVQ which allows 16 bytes to be written via a byte mask. Such a feature would be useful on the GPU also. In fact due to the nature of the GPU, a write with blending combination would be ideal. The manual process of reading a destination buffer and writing a combined result, or performing a condition per write is undesirable. It is also often desirable to treat a destination buffer as write only and uncached.

You may be thinking too hard in an “SSE intrinsics” mindset. In CUDA, it is extremely efficient to do something very straightforward like if (mymask) destination[index]=x;

You say “Oh no, but that’s just a single op in SSE!” but in CUDA, this becomes effectively the same thing since CUDA’s predicates are effectively a mask with tiny, even no, overhead. (Predicates are exposed at a lower code level than CUDA, in PTX assembly, but that’s what the simple CUDA if() test will end up using.)

So CUDA doesn’t usually need the wild collections of arrangements of SSE variants… you can just write code in human-readable if() form (or x>y ? a :b style), and CUDA will do the right thing, efficiently. Check the op throughput test program thread for examples which actually measure the cheapness of such easy to use implicit masks.

What you are saying makes sense. I don’t have profiling figures to determine whether my write only and read & write buffers are bottle necks anyway. I still wish I had a bunch of spare threads to assign to local vector operations inside (what is currently) a single thread.