Possible to use the CUDA math API integer intrinsics to find the nth unset bit in a 32 bit int

Because 13! and up is too large for a 32 bit integer, I have to use 64 bit integers for the factorial decomposition.

For example each thread in a block generates a set of 64 bit numbers from which the permutation indices are derived, and part of that process involves 64 bit integer subtraction and comparison. Also the shared memory is 64 bit because I need to store the permutation number associated with the optimal test value.

Fortunately there are not too many different 64 bit operations, an example:

//12
		c=12;
		tmp=5748019200LL;
		while(tmp>tnum){tmp-=479001600LL;c--;}

		cc=~B;
		for(idx=0;idx<c;idx++){
			cc=cc&(cc-1);
		}
		idx=__ffs(cc)-1;

		if((B&D_depend[idx])==D_depend[idx]){//evaluate
			temp_perm_value+=d_val[idx];
		}
		B|=(1<<idx);
		tnum-=tmp;//subtract off

with tmp and tnum being 64 bit, and the others being 32 bit.

Sorry if I was unclear

Thanks for the clarification. It makes sense that you had to switch to 64-bit integers, but that should have no impact on the relative performance of GTX 980 vs GTX 980M, or am I still missing something?

I thought that the relative difference between the GTX 980 64 bit flops to 32 bit flops was higher than that of the GTX 980m.

Based on CUDA-Z readings the 32 bit Gflops for the GTX 980 is 5246, and 64 bit Gflops is 180 (3.4%) , while for the GTX 980m the 32 bit Gflops is 3410, and the 64 bit Gflops is 108 (3.1%).

So there is a slightly better ratio for the GTX 980 desktop, but not as much as I initially thought. I used that information to come to explain the difference performance of the 64 bit integer brute force application.

I was mostly incorrect in my assessment.

You seem to be looking at the floating-point performance reported by CUDA-Z, while your code seems to use integer computations exclusively. There is only a loose connection between the floating-point and the integer performance on the GPU. The only direct connection is that instructions that convert floating-point data into 64-bit integers and vice versa must be sent through the double-precision units as that is the only 64-bit wide datapath.

The cost of emulated 64-bit integer operations fluctuates from 2x the equivalent 32-bit integer computation in the case or addition or logical operations, to more like 4x in the case of 64-bit division. The biggest factor in that is the number of 32-bit operations required by the emulation sequences.

For what it is worth, the double-precision hardware throughput on a Maxwell-class consumer GPU is 1/32 of the single-precision hardware throughput; slight deviations in the actual observed performance ratio could be due to benchmarking artifacts of the CUDA-Z program, for example thread blocks distributing unevenly across the SMs.

A related question; If I know that I need to perform this operation

cc=cc&(cc-1);

“c” number of times(c>=0 && c<num_elements), is there a better method to accomplish this without using a loop like this?:

for(idx=0;idx<c;idx++){
        cc=cc&(cc-1);
}

Wondering if there is some combination of bit manipulations which can accomplish the same thing? Have not found anything yet, but would appreciate suggestions. Even a modest improvement would end up making a big difference.

Am writing up a document on my latest single GPU CUDA implementation of evaluating permutations where the number of elements >=13 .

At this point evaluating, reducing and returning an optimal permutation and the corresponding optimal value takes 2.74 seconds for 13 elements (6,227,020,800 memory arrangements evaluated each in N steps) including memory copies to-from host.

As a means of comparison the same algorithm on an i7 CPU overclocked to 4.5 GHz takes about 126 seconds to generate the same evaluation results for 13 elemetns(using STL::next_permutation(), compiled with full -03 optimizations VS 2012).

For 14 elements (87,178,291,200 distinct memory arrangements, each being evaluated in N steps) it takes about 42 seconds with CUDA on a single stock GTX 980(not overclocked)

Given that the 4.5 GHz CPU version for 14 elements will take over 30 minutes, I am going to have to run that test at night…

Definitely will give njuffa a mention in my document, as his suggestion made a significant improvement in running time.

It goes to show how different GPU programming is from CPU programming, in that some seemingly ‘trivial’ optimizations can make a large performance difference. And it is always illuminating to determine how and why such an optimization made a difference…

If I understand correctly, you want to clear the first N set bits in a word, starting the search at the least significant bit. If you look at #13 above, I gave some branch-less code for finding the n-th clear bit, starting at the least significant bit. The basic building block for this is the binary population count algorithm. You should be able to use a very similar approach to clear the first N set bits. However the resulting code will likely have similar complexity to the code in #13, which means it probably won’t beat the simple loop.

Yes, and at this point I only really need to consider the first 16 bits(starting from least significant).

Thanks, I am going to work with that code in your response #13.

Below is working, tested code. Reference function followed by the reworked branch-less code. Instead of computing ‘r’ and then shifting the mask by ‘r’ at the end, you could try shifting a mask by the appropriate amount directly, but I doubt you’ll notice much of a difference in performance. I am still thinking whether any of this machinery can be saved to streamline the code now that the task is somewhat different.

unsigned int clear_n_least_significant_1_bits (unsigned int cc, int c)
{
    int idx;
    for (idx = 0; idx < c; idx++) {
        cc = cc & (cc-1);
    }
    return cc;
}

int my_clear_n_one_bits (unsigned int a, int n)
{
    int t, i = n, r = 0;

    unsigned int mask = 0;
    unsigned int c1 = a;
    unsigned int c2 = c1 - ((c1 >> 1) & 0x55555555);
    unsigned int c4 = ((c2 >> 2) & 0x33333333) + (c2 & 0x33333333);
    unsigned int c8 = ((c4 >> 4) + c4) & 0x0f0f0f0f;
    unsigned int c16 = ((c8 >> 8) + c8);
    int c32 = ((c16 >> 16) + c16) & 0x3f;
    t = (c16    ) & 0x1f; if (i >= t) { r += 16; i -= t; }
    t = (c8 >> r) & 0x0f; if (i >= t) { r +=  8; i -= t; }
    t = (c4 >> r) & 0x07; if (i >= t) { r +=  4; i -= t; }
    t = (c2 >> r) & 0x03; if (i >= t) { r +=  2; i -= t; }
    t = (c1 >> r) & 0x01; if (i >= t) { r +=  1;         }
    if (n < c32) mask = 0xffffffff << r;
    r = a & mask;
    return r;
}

Getting back to the comparison with the current loop approach, and the function in post response #13, the loop approach is modestly faster than calling the function.

My recently mentioned idea to find an optimal way to clear n bits, is still part of the “find the nth unset bit” objective.

going to fiddle with this a bit more and post my results.

Thanks for the clear tested example!

Try a lookup table approach, perhaps it can be stored in shared memory.

The lookup table approach would require 1 operation per byte. (table size would be 256 elements, and which return number of unset bits). Let me know how fast this approach is ;)

A lookup table using Constant memory will probably perform better than one that uses Shared memory.

In order to have a lookup table for the first unset bit (assuming 14 elements) then the number of elements for the table would be 14*(2^14) which would be a bit too large to be practical for either shared or constant memory.

That is unless I did not correctly understand how that lookup table would work.

Regardless, thanks for the suggestions.

Code (both CUDA and C++) for evaluation of all array permutations for 13 and 14 elements posted here, though this site is still a work in progress(Web development is not my thing);

[url]https://sites.google.com/site/cudapermutations/[/url]

The running time for algorithm is about (N!*N +K) where N is the number of elements and K is a large constant factor related to the derivation of the permutation indices. There are quite a number of memory operations which probably make up most of the running time.

Generation and Evaluation of 13! permutations: Single 1.3 GHz GTX 980 2.5 seconds, Intel i7 4.5 GHZ 128.7 seconds

Generation and Evaluation of 14! permutations: Single 1.3 GHz GTX 980 38.373 seconds, Intel i7 4.5 GHZ 1937.32 seconds

Used a variation of Jimmy Pettersson’s general reduction structure to distill the return values, which is also explained on the site.

This is a ‘proof of concept’ project not a public application.

The answers are verified via the std:next_permutation() function.

Expect a paper on this in a few months, with a multi-gpu version used to tackle 18! .

Also posted laptop times for the mobile GTX 980m vs an overclocked Intel 3.3 GHz notebook CPU.

It’s a multi-byte lookup using a singly byte lookup table.

You write “first unset bit” so I will assume this means all bits set to one except one which would be zero.

So work out all possible bit patterns, I’ll give you a 4 bit lookup table example:

Now for this example I will assume least significant bit is on the right side and consider this to be the first bit, the table returns the position of first unset bit:
Bit positions ranging from 0 to 3, 4 has a special meaning meaning more bits need to be examined:

LookupValue : ReturnBitPosition

0000 : 0
0001 : 1
0010 : 0
0011 : 2
0100 : 0
0101 : 1
0110 : 0
0111 : 3
1000 : 0
1001 : 1
1010 : 0
1011 : 2
1100 : 0
1101 : 1
1110 : 0
1111 : 4

Now using the lookup table should be easy to understand, stuff the first 4 bits into it… examine the return value, if it’s less than 4 it’s done. If it’s 4 proceed with looking up next 4 bits, add return position together to get final position.

This example can be scaled up to any number of bits, 8 bit to process bytes seems practical.

Total number of instructions increases though and a branch/while loop might also be less ideal.

Perhaps the concept of the lookup table can be modified/changed to make a branchless version.

An additional lookup table could be used to create a multiplier to avoid the branch and to also process all 4 bit chunks all at once it would look something like this:

The multiplier table returns a 1 if the result from the return position lookup table for a second part lookup should be added to final result and a zero if it should not be added.

Example of such a multiplier lookup table:
0000: 0
0010: 0
0011: 0
etc
until
1111:1

Only if the result from the first 4 bits is 1 should the result of the next 4 bits be added to return position. (Some kind of explosions outside distracting me a bit… hope it’s nothing bad… anyway)…

Pseudo code to use lookup table:

UnSetBitPosition = BitPosition[First4Bits] + (BitPosition[Next4Bits] * Multiplier[First4Bits]) + (BitPosition[Third4Bits] * Multiplier[Next4Bits]) + (BitPosition[Four4Bits] * Multiplier[Third4Bits]), etc you get the idea.

If this will be faster or not remains to be seen. It does feel a little bit like a dependency chain… it has to be retrieved in a certain order…

Your final byte code would look something like:

UnSetBitPosition = BitPosition[Byte0] +
(BitPosition[Byte1] * Multiplier[Byte0]) +
(BitPosition[Byte2] * Multiplier[Byte1]) +
(BitPosition[Byte3] * Multiplier[Byte2]);

This would be a 32 bit example.

If only 14 bits required then:

UnSetBitPosition = BitPosition[Byte0] +
(BitPosition[Byte1] * Multiplier[Byte0]);

Byte1 should be masked off or so… to zero out last 2 bits, which would be an and instruction.
So my guess for ammount of total instructions would be:
3 lookups + 1 addition + 1 multiplier. (Maybe the multiply and add could be fused together)

5 instructions in total… not too shabby ! ;) I am also aware that shared memory has drawbacks… it can be slow or perform bad… not sure why… maybe banking conflicts or it’s just slow… different storages could be tried to see which one works fastest… and finally compare “lookup” table approach with “compute approach”… see which is faster ;)

Ok I noticed a little logic mistake in the example code… the multiplier of all previous bytes should be multiplied to zero it out properly…

UnSetBitPosition = BitPosition[Byte0] +
(BitPosition[Byte1] * Multiplier[Byte0]) +
(BitPosition[Byte2] * Multiplier[Byte0]*Multiplier[Byte1]) +
(BitPosition[Byte3] * Multiplier[Byte0]*Multiplier[Byte1]*Multiplier[Byte2]);

That should do it… it probably best to store the multiplier seperately to avoid the previous multiplications from happening again ;) this optimization is left as an exercise for the reader ;)

UnSetBitPosition = BitPosition[Byte0] +
(BitPosition[Byte1] * Multiplier[Byte0]) +
(BitPosition[Byte2] * Multiplier[Byte0]*Multiplier[Byte1]) +
(BitPosition[Byte3] * Multiplier[Byte0]*Multiplier[Byte1]*Multiplier[Byte2]);

I kinda like optimizing my own code to make sure it can compete well with other solutions so let’s do that and optimize it:

UnSetBitPosition = BitPosition[Byte0];
Multiplier = MultiplierTable[Byte0];

UnSetBitPosition = UnSetBitPosition + (BitPosition[Byte1] * Multiplier);
Multiplier = Multiplier * MultiplierTable[Byte1];

UnSetBitPosition = UnSetBitPosition + (BitPosition[Byte2] * Multiplier);
Multiplier = Multiplier * MultiplierTable[Byte2];

UnSetBitPosition = UnSetBitPosition + (BitPosition[Byte3] * Multiplier);

That should do it.

Predicted number of instructions:

4 lookups for bit position
3 lookups for multiplier
3 additions
5 multiplications

Number of assignments is hard to tell, sometimes it might be multiplied in place and added in place sometimes not.

For now my guess would be, temporarlies needed for multiplications with bit position.
and those would need to be added… so some assignments needed to the temps… so at least 3 assignments.
3 assignments/moves.

total number of instructions: 18 or so.

number of bits processed: 32
number of instructions takes 18
instructions per bit: 18 / 32 = 0.5625

or vice versa:
bits per instruction: 1.7777

Funny numbers :)

Interesting, I need to read this a few times to understand.

Thanks!

It has nothing to do with elements. I am not sure what you mean with elements ? At first I thought you ment bits, but now I read your other postings a bit more and it seems you want to try all kinds of combinations. The ammount or number of elements is irrelevant. It is a general problem and has to do with the binary code not your elements. This problem and solution can be applied to any binary value.

Thus as long as you can express your elements as binary values, which is eventually what all values are inside a computer, and as long as your description of “bits” and “first bit unset” etc is part of the binary code, and the elements follow the binary numbering and such, this method should work.

So simply stuff your elements into a 8 bit, 16 bit, 32 bit or 64 bit register or integer… and then start feeding it into the tables etc… or routine…

Value = YourElement

Either masked the value of to extract first 8 bits and then shift it down later and re-apply mask to again extract it or use a 4 byte indexing array to get each byte out of that value variable.
C should be able to support array indexing so something like:

unsigned char *p
p = &value;
Byte0 = P[0]
Byte1 - P[1]
Byte2 = p[2]
Byte3 = p[3]