Allocating threads using bit mask

I have a mask of 496 bits (for extra programming fun it is a triangular matrix)
Typically 30% of the mask is set.
Only for each set bit do I need to do some work.

The current approach is to use a block size of 512 and take the performance hit
of about 363 threads being idle because, either their thread id is above 495
or their bit in the mask is zero.

The next thought was to reduce the block size to 32
and separate the thread id from the work to be done (indexed by p,q).
So each of the 32 threads will be given a p,q index corresponding to
a bit in the mask being set. Once 32 thread warp finishes, we go back to the
mask and find the next collection of 32 set bits,
and so on until we have processed the whole of the mask.
This would take on average about five passes of one warp
(rather than one pass of 16 warps at present).

Given the work to be done is mostly going to be reading global memory
with at present little integer arithmetic, I wanted to know
if other people thought this was worth trying. Even using njuffa’s code,
there will be a fair amount of bit fiddling to convert the bit mask into
a list of (pq) indexes of work to be done.

I am planning to access the 14 arrays in the same order as now.
(The clear bits in the mask typically correspond to elements in these
arrays which are skipped over.)

(Something similar was posted here on 18 Jul 2011, but my problem is smaller
and denser and njuffa had not posted his find_nth_set_bit 12/29/2014 08:48 PM
by 2011.)
Oh also, the sides of the mask are 31, so every column of the mask will fit
into 64 bits, so would it be worth converting find_nth_clear_bit from 32
unsigned int to uint64_t ? Or should I shift each column of the mask to start
at the least significant bit of an unsigned int?

As always any help or guidance would be most welcome

                        Thank you


    Prof. W. B. Langdon
    Department of Computer Science
    University College London
    Gower Street, London WC1E 6BT, UK

GI-2018 deadline 5th February 2018
2018 Humies
EuroGP 2018
choose your background
A Field Guide to Genetic Programming
GP Bibliography

If I understand the problem correctly, I don’t see how limiting the thread count below the amount of parallelism naturally provided by the task at hand would help. Given that modern GPUs provide thousands of CUDA cores even the original block of 512 threads leads to very poor occupancy. Or are there several of these blocks? It isn’t quite clear to me, but I understood there to be only a single such block.

You might want to think about alternative representations to a spare bit vector, but in any event, the basic limitation here seems to an inherent lack of parallelism, not necessarily the amount of computation to be performed per thread. Are there ways to increase the parallelism of this processing step or are the underlying data structures just small in general? Is batch processing a possibility?

Dear njuffa,
Opps sorry my description was not very good. It describes one block.
Typically there are up to 1000 such blocks.


just to report back, it took about 5 man days to code and debug.
My kernel uses 13 more registers and is not noticeably faster:-(

PS: I estimate typically the kernel (ie all blocks) will read up to 600kbytes
(approx 200kb constant, the rest might be updated between kernel calls).
given this is less than the L2 cache (2MB), would it be reasonable to ignore
memory transfers and concentrate upon the, divergent, code?

analysis driven optimization

Dear txbob,
I am not sure I understand.
Are you recommending I do more “analysis driven optimization” ?
Does nvvp support “analysis driven optimization” ?



when I google analysis driven optimization, this is the very first hit I get:

nvvp certainly supports analysis driven optimization