Efficient implementation of bitwise majority-of-N operations for N in {3, 5, 7, 9}

The ternary majority-of-3 operation ⟨xyz⟩, also known as the median-of-3 operation, returns the bit value of the majority of its inputs, that is

x  y  z | res
--------+----
0  0  0 |  0
0  0  1 |  0
0  1  0 |  0
0  1  1 |  1
1  0  0 |  0
1  0  1 |  1
1  1  0 |  1
1  1  1 |  1

As Knuth points out in TAOCP Vol. 4a, since x ∧ y = ⟨x0y⟩ and x ∨ y = ⟨x1y⟩, one can compose any monotone Boolean function from the majority-of-3 operation and the constants 0 and 1. If one adds NOT this extends to fully generalized Boolean functions. This is of more than theoretical significance. For example, it turns out that the majority-gate is a useful building block in quantum-dot cellular automata, which are being researched as a potential follow-on to currently used CMOS technology.

In particular, one can implement majority-of-N functionality via a majority-of-3 building block. How to do that in an optimal fashion, i.e. while using the minimal number of majority-of-3 building blocks, is only known with certainty for majority-of-5 and majority-of-7 operations, and an open research problem for larger N. While a quite efficient construction of the majority-of-9 operation has been demonstrated that uses thirteen majority-of-3 building blocks, it is currently not known whether this is optimal.

A recent question of Stackoverflow inquired how to combine 3, 5, 7, or 9 source bitstreams of considerable length into a single bitstream that records the bitwise majority across the input streams. This task seems well suited to modern GPUs (compute capability >= 5.0) that support the LOP3 instruction, as each majority-of-3 operation can be mapped to a single LOP3 instruction.

As it turns out, using regular logical operations (~, &, |) to specify the majority-of-3 operation leads to quite unsatisfactory results with the CUDA compiler, in that it generates too many LOP3 instructions. For an efficient implementation it is therefore currently necessary to directly code the majority-of-3 operation using inline assembly when using it to implement the majority-of-N functionality, as demonstrated below.

[ code below updated 12/21/2023 ]

/*
  Copyright (c) 2023, Norbert Juffa

  Redistribution and use in source and binary forms, with or without 
  modification, are permitted provided that the following conditions
  are met:

  1. Redistributions of source code must retain the above copyright 
     notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright
     notice, this list of conditions and the following disclaimer in the
     documentation and/or other materials provided with the distribution.

  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 
  "AS IS" AND ANY ERFCRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 
  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
  HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 
  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 
  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/

/* bitwise majority of 3 */
__device__ uint32_t maj3 (uint32_t a, uint32_t b, uint32_t c)
{
#if (__CUDA_ARCH__ >= 500)
    uint32_t r;
    asm ("lop3.b32 %0,%1,%2,%3,0xe8;\n\t" : "=r"(r) : "r"(a), "r"(b), "r"(c));
    return r;
#else // __CUDA_ARCH__
    return (((b & c) | a) & (b | c));
#endif // __CUDA_ARCH__
}

/* bitwise majority of 5 */
__device__ uint32_t maj5 (uint32_t a, uint32_t b, uint32_t c, uint32_t d, 
                          uint32_t e)
{
    /* Knuth, TAOCP Vol. 4a, p. 64 */
    return maj3 (a, maj3 (c, d, e), maj3 (b, c, maj3 (b, d, e)));
}

/* bitwise majority of 7 */
__device__ uint32_t maj7 (uint32_t a, uint32_t b, uint32_t c, uint32_t d, 
                          uint32_t e, uint32_t f, uint32_t g) 
{ 
    /* Eleonora Testa, et al., "Mapping Monotone Boolean Functions into Majority",
       IEEE Transactions on Computers, Vol. 68, No. 5, May 2019, pp. 791-797. 
    */
    uint32_t s = maj3 (a, c, d);
    uint32_t t = maj3 (e, f, g);
    return maj3 (b, maj3 (e, s, maj3 (f, g, s)), maj3 (d, t, maj3 (a, c, t)));
}

/* bitwise majority of 9 */
__device__ uint32_t maj9 (uint32_t a, uint32_t b, uint32_t c, uint32_t d, 
                          uint32_t e, uint32_t f, uint32_t g, uint32_t h, 
                          uint32_t i)
{
    /* 
      Thomas Häner, Damian S. Steiger, Helmut G. Katzgraber, "Parallel Tempering
      for Logic Synthesis." arXiv.2311.12394, Nov. 21, 2023
    */
    uint32_t r = maj3 (g, d, c);
    uint32_t s = maj3 (g, e, b);
    uint32_t t = maj3 (i, f, a);
    uint32_t u = maj3 (r, s, h);
    uint32_t v = maj3 (d, h, t);
    uint32_t w = maj3 (c, d, h);
    uint32_t x = maj3 (i, a, u);
    uint32_t y = maj3 (c, v, t);
    uint32_t z = maj3 (y, e, g);
    return maj3 (maj3 (x, u, f), maj3 (z, b, y), maj3 (s, w, t));
}
3 Likes

Do you know why it is hard to proof the minimum number of maj3 operations? I guess there is a combinatorial explosion when trying to brute force it for the value (min-1). Is finding the optimal number an np-hard problem?

Honestly I have not looked into it. NP-hard sounds about right, though. Häner et al. examined majority-of-11 and majority-of-13 as well and from their published numbers the processing time of their solver explodes in exponential fashion. Their paper seems definitely worth a read and I would expect it to be published in a “proper” venue soon, but I basically went straight to the results section to get my coding done.

I would assume the median-of-N decomposition problem is as hard as finding provably optimal solution to sorting networks, where progress has likewise been glacial since TAOCP was first published, although the upper and lower bounds for that problem have been slowly converging for N < 20.

If I recall correctly from my quick perusal of the literature, the currently established lower bound for a majority-of-9 implementation is ten median-of-3 instances while here we have an existing solution using thirteen.

1 Like

An interesting follow-up question is: How much more efficiently could one represent majority-of-N operations when one decomposes them into arbitrary LOP3 instances?

The issue of mapping logic to LUTs has been researched extensively for the past 30 years in the context of mapping logic into FPGAs which are typically based on CLBs (configurable logic blocks), each of which comprises one or several LUTs. I tried to get a quick overview from the literature but reading a bunch of papers only made my head explode.

Given that a LUT-based ternary logic operation also exists in AVX512 (vpternlog{d|q}) and that Intel’s compilers also appear to be struggling to use that in the most efficient manner possible, one can only hope that some of the existing algorithms from the FPGA world get ported to general-purpose compilers soon.

If you want to play with maj5 and maj7, this project may be of interest.

It’s one I’ve had success with lop3 'ing conventional Sboxes. I just did a quick test on 10,000 iterations of maj5 and the best it could do was 6 LOP3 instructions, but perhaps if left for a longer period, it would trim it down.

On maj7 one iteration across 4 cores took about 30 seconds, so I only did a couple and they both used 13 LOP3’s.

This is the code I used to create the required sbox file for maj5:

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>

void main(void){

        uint i;
        uint32_t a;

        for(i = 0; i < 32; i++){
                a = __builtin_popcount(i);
                a = a > 2 ? 1 : 0;
                printf("%d ", a);
        }
}

This paper may be of interest on another algorithm generating LUT based solutions.

As I said, reading papers on logic synthesis is like drinking from the fire hose, and it makes my head hurt. There is probably a reason EDA companies employ armies of PhDs …

I tried playing with a freely available synthesis tool ABC from Berkeley that can target LUTs, and even for the majority-of-5 gate (also proposed as a basic building block for some nanotechnologies) it produces results nowhere near optimal. That is probably not just the fault of the tool but me supplying {suboptimal | incorrect} commands.

I think I will stop here. I do not have to solve every problem in the universe myself. Given that Intel and NVIDIA (and maybe others, I have not checked) introduced ternary logic ops based on LUTs a decade ago (with Xeon Phi and Maxwell, respectively), I think they should make an effort to allow software to exploit them fully.

I’m definitely with you there, a few orders of magnitude further back… : )

Enjoy the holidays.