Is it possible to use logic gates in warp level primitives?

jakub.mitura14 · September 1, 2021, 5:18am

I have two 3 dimensional boolean arrays representing segmentation output from algorithm (array A) and gold standard mask (arrayG) that I put into kernel I would like to compare them, and in theory logic gates could be the fastest solution.
I need to count true positive (TP), false positive (FP), true negative (TN) and false negative(FN). so for some coordinate x’.
TP = A[x’]==1 && G[x’]==1 – hence AND gate
TN = A[x’]==0 && G[x’]==0 – hence NAND gate

FP = A[x’]==1 && G[x’]==0 --hence (1 AND A[x’]) AND ( 0 NAND G[x’] )
FN = A[x’]==0 && G[x’]==1 --hence (0 NAND A[x’]) AND ( 1 AND G[x’] )

Hence output from such warp should be Vector with 4 entries [count(TP), count(TN), count(FP) , count(FN) ]
Is it possible and does it make sense to apply such things in warp level primitive- or maybe in kernel itself? There is very little daa to exchange between threads so warp level synchronization seems to be better then block synchronization.

Later I would use sum reduction kernel methods, maybe cooperative groups, depends on experiments results.

Thanks for help !

njuffa · September 1, 2021, 6:05am

I am not sure I understand what you are trying to to. Would a bit-slice approach work for your use case in order to avoid warp-level constructs? FWIW, modern GPUs have a LOP3 instruction that can compute any logical function of three inputs.

jakub.mitura14 · September 1, 2021, 6:47am

Thanks I will invastigate those. I was just reading about warp level primitives like __ballot_sync() , and was wandering weather it makes sense to use them in my use case . Also Ussually getting some lower level instructions is faster than abstracted ones so I am wandering for example is bitwise and is the same as and gate etc.?

njuffa · September 1, 2021, 6:55am

As far as I can tell on the latest GPU architectures the compiler maps pretty much all logical operations to LOP3, with the zero register as the third input if need be. This makes for great efficiency as any logical expression of three inputs, no matter how complicated, can be mapped to one instruction. But it also makes the generated machine code (SASS) really hard to read for humans, as something like LOP3 0xee is pretty meaningless.

As far as I know, LOP3 has the same throughput as IADD (integer add) so it has maximum throughput possible. However it is always a good idea to check such assumptions by (1) consulting documentation and (2) timing benchmark cases. The internal workings of LOP3 are not documented; one might be tempted to search for a relevant patent that describes it.

striker159 · September 1, 2021, 7:52am

For starters, you could use something like thrust::transform_reduce thrust: thrust::transform_reduce . Zip both arrays together, apply transformation to compute for example TP, and perform sum reduction.

If you can pack the booleans into a bit array, e.g 32 values in a single 32-bit integer, you could use popcount for efficient processing. Popcount counts the number of bits which are set to 1 in an integer

unsigned int packedA = A[x']; // 0x10011
unsigned int packedB = G[x']; // 0x10101
unsigned int AND = packedA & packedB; // 0x10001
int TP = popcount(packedA & packedB) // 2

(Note that a NAND operation will not only return true for input (0,0). I think you should use NOR instead)

jakub.mitura14 · September 6, 2021, 6:43pm

Thank You extremely intresting, I will investigat it now !

Topic		Replies	Views
Is it a good idea to convert all logical operators into bitwise operators to stop short-circuiting for better warp divergence? CUDA Programming and Performance	4	50	March 3, 2025
do warp vote functions cause branching? CUDA Programming and Performance	16	3626	August 11, 2010
Concurrent writing to a global variable CUDA Programming and Performance	10	2334	December 7, 2013
Reverse LUT for LOP3.LUT CUDA Programming and Performance	5	2810	December 30, 2023
Efficient implementation of bitwise majority-of-N operations for N in {3, 5, 7, 9} CUDA Programming and Performance	6	318	December 23, 2023
How to compare value within a warp is positive or negative using warp-level primitive? CUDA Programming and Performance	1	290	December 29, 2023
Throughput for certain integer arithmetic instructions. CUDA Programming and Performance	5	1744	January 15, 2020
Looking for logical compute ceiling Found magic CUDA optimizations CUDA Programming and Performance	7	2928	February 2, 2010
Warp Reduction in Kernel with "if" guard CUDA Programming and Performance	6	914	December 29, 2020
Best way to find many minimums CUDA Programming and Performance	8	2352	January 3, 2018

Is it possible to use logic gates in warp level primitives?

Related topics