Converting bit mask into list

Suppose I have a large array in device memory. Each byte in the array is a bit mask of 1’s and 0’s. Most bits are 0, 1’s are random and fairly rare (on average, say, only 1 out of 1000 bit is set.)

I need to convert this into a sorted list that contains positions of set bit. This conversion is a major bottleneck in my program.

I’ve been able to cook up a function that gets up to about 15 GB/s on my video card (GTX 560), but I’m wondering if there’s a standard solution, or maybe something in thrust, that can do it faster.

Sounds like a case of “use scan”.

Calculating positions for each bit that is set by using the scan.

Then have all threads check their positions and the bit.

All threads which have their bit set, allocate a node and connect themselfes to the node before and after them if any.

(This will create a doubly linked list as you seem to request ?)

(Could also be just singled linked list)

If with “list” you ment an array containing the positions of all bits set, then after scan you already done. See search for “pre fix sum scan”.

Not a single or double linked list. Just an array of type ‘int’.

I don’t have a standard solution nor a library routine, but I wonder if you can take advantage of the fact that combining [font=“Courier New”]__clz()[/font] (or [font=“Courier New”]__ffs()[/font] for the other direction) and [font=“Courier New”]__ballot()[/font] you can find the first bit set in 1024 bits with only a handful of instructions.