I am wondering if there is any fast way (intrinsics, exposed HW support) for write masking or write combining in the context of blending.
The Intel x86 has a SSE instruction MASKMOVQ which allows 16 bytes to be written via a byte mask. Such a feature would be useful on the GPU also. In fact due to the nature of the GPU, a write with blending combination would be ideal. The manual process of reading a destination buffer and writing a combined result, or performing a condition per write is undesirable. It is also often desirable to treat a destination buffer as write only and uncached.