Is it possible to use logic gates in warp level primitives?

I have two 3 dimensional boolean arrays representing segmentation output from algorithm (array A) and gold standard mask (arrayG) that I put into kernel I would like to compare them, and in theory logic gates could be the fastest solution.
I need to count true positive (TP), false positive (FP), true negative (TN) and false negative(FN). so for some coordinate x’.
TP = A[x’]==1 && G[x’]==1 – hence AND gate
TN = A[x’]==0 && G[x’]==0 – hence NAND gate

FP = A[x’]==1 && G[x’]==0 --hence (1 AND A[x’]) AND ( 0 NAND G[x’] )
FN = A[x’]==0 && G[x’]==1 --hence (0 NAND A[x’]) AND ( 1 AND G[x’] )

Hence output from such warp should be Vector with 4 entries [count(TP), count(TN), count(FP) , count(FN) ]
Is it possible and does it make sense to apply such things in warp level primitive- or maybe in kernel itself? There is very little daa to exchange between threads so warp level synchronization seems to be better then block synchronization.

Later I would use sum reduction kernel methods, maybe cooperative groups, depends on experiments results.

Thanks for help !

I am not sure I understand what you are trying to to. Would a bit-slice approach work for your use case in order to avoid warp-level constructs? FWIW, modern GPUs have a LOP3 instruction that can compute any logical function of three inputs.

1 Like

Thanks I will invastigate those. I was just reading about warp level primitives like __ballot_sync() , and was wandering weather it makes sense to use them in my use case . Also Ussually getting some lower level instructions is faster than abstracted ones so I am wandering for example is bitwise and is the same as and gate etc.?

As far as I can tell on the latest GPU architectures the compiler maps pretty much all logical operations to LOP3, with the zero register as the third input if need be. This makes for great efficiency as any logical expression of three inputs, no matter how complicated, can be mapped to one instruction. But it also makes the generated machine code (SASS) really hard to read for humans, as something like LOP3 0xee is pretty meaningless.

As far as I know, LOP3 has the same throughput as IADD (integer add) so it has maximum throughput possible. However it is always a good idea to check such assumptions by (1) consulting documentation and (2) timing benchmark cases. The internal workings of LOP3 are not documented; one might be tempted to search for a relevant patent that describes it.

1 Like

For starters, you could use something like thrust::transform_reduce thrust: thrust::transform_reduce . Zip both arrays together, apply transformation to compute for example TP, and perform sum reduction.

If you can pack the booleans into a bit array, e.g 32 values in a single 32-bit integer, you could use popcount for efficient processing. Popcount counts the number of bits which are set to 1 in an integer

unsigned int packedA = A[x']; // 0x10011
unsigned int packedB = G[x']; // 0x10101
unsigned int AND = packedA & packedB; // 0x10001
int TP = popcount(packedA & packedB) // 2

(Note that a NAND operation will not only return true for input (0,0). I think you should use NOR instead)

1 Like

Thank You extremely intresting, I will investigat it now !