The ternary majority-of-3 operation ⟨xyz⟩, also known as the median-of-3 operation, returns the bit value of the majority of its inputs, that is
x y z | res
--------+----
0 0 0 | 0
0 0 1 | 0
0 1 0 | 0
0 1 1 | 1
1 0 0 | 0
1 0 1 | 1
1 1 0 | 1
1 1 1 | 1
As Knuth points out in TAOCP Vol. 4a, since x ∧ y = ⟨x0y⟩ and x ∨ y = ⟨x1y⟩, one can compose any monotone Boolean function from the majority-of-3 operation and the constants 0 and 1. If one adds NOT this extends to fully generalized Boolean functions. This is of more than theoretical significance. For example, it turns out that the majority-gate is a useful building block in quantum-dot cellular automata, which are being researched as a potential follow-on to currently used CMOS technology.
In particular, one can implement majority-of-N functionality via a majority-of-3 building block. How to do that in an optimal fashion, i.e. while using the minimal number of majority-of-3 building blocks, is only known with certainty for majority-of-5 and majority-of-7 operations, and an open research problem for larger N. While a quite efficient construction of the majority-of-9 operation has been demonstrated that uses thirteen majority-of-3 building blocks, it is currently not known whether this is optimal.
A recent question of Stackoverflow inquired how to combine 3, 5, 7, or 9 source bitstreams of considerable length into a single bitstream that records the bitwise majority across the input streams. This task seems well suited to modern GPUs (compute capability >= 5.0) that support the LOP3
instruction, as each majority-of-3 operation can be mapped to a single LOP3
instruction.
As it turns out, using regular logical operations (~
, &
, |
) to specify the majority-of-3 operation leads to quite unsatisfactory results with the CUDA compiler, in that it generates too many LOP3
instructions. For an efficient implementation it is therefore currently necessary to directly code the majority-of-3 operation using inline assembly when using it to implement the majority-of-N functionality, as demonstrated below.
[ code below updated 12/21/2023 ]
/*
Copyright (c) 2023, Norbert Juffa
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:
1. Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY ERFCRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
/* bitwise majority of 3 */
__device__ uint32_t maj3 (uint32_t a, uint32_t b, uint32_t c)
{
#if (__CUDA_ARCH__ >= 500)
uint32_t r;
asm ("lop3.b32 %0,%1,%2,%3,0xe8;\n\t" : "=r"(r) : "r"(a), "r"(b), "r"(c));
return r;
#else // __CUDA_ARCH__
return (((b & c) | a) & (b | c));
#endif // __CUDA_ARCH__
}
/* bitwise majority of 5 */
__device__ uint32_t maj5 (uint32_t a, uint32_t b, uint32_t c, uint32_t d,
uint32_t e)
{
/* Knuth, TAOCP Vol. 4a, p. 64 */
return maj3 (a, maj3 (c, d, e), maj3 (b, c, maj3 (b, d, e)));
}
/* bitwise majority of 7 */
__device__ uint32_t maj7 (uint32_t a, uint32_t b, uint32_t c, uint32_t d,
uint32_t e, uint32_t f, uint32_t g)
{
/* Eleonora Testa, et al., "Mapping Monotone Boolean Functions into Majority",
IEEE Transactions on Computers, Vol. 68, No. 5, May 2019, pp. 791-797.
*/
uint32_t s = maj3 (a, c, d);
uint32_t t = maj3 (e, f, g);
return maj3 (b, maj3 (e, s, maj3 (f, g, s)), maj3 (d, t, maj3 (a, c, t)));
}
/* bitwise majority of 9 */
__device__ uint32_t maj9 (uint32_t a, uint32_t b, uint32_t c, uint32_t d,
uint32_t e, uint32_t f, uint32_t g, uint32_t h,
uint32_t i)
{
/*
Thomas Häner, Damian S. Steiger, Helmut G. Katzgraber, "Parallel Tempering
for Logic Synthesis." arXiv.2311.12394, Nov. 21, 2023
*/
uint32_t r = maj3 (g, d, c);
uint32_t s = maj3 (g, e, b);
uint32_t t = maj3 (i, f, a);
uint32_t u = maj3 (r, s, h);
uint32_t v = maj3 (d, h, t);
uint32_t w = maj3 (c, d, h);
uint32_t x = maj3 (i, a, u);
uint32_t y = maj3 (c, v, t);
uint32_t z = maj3 (y, e, g);
return maj3 (maj3 (x, u, f), maj3 (z, b, y), maj3 (s, w, t));
}