I am thinking a divide and conquer algorithm where buckets are processed in groups of 32, which matches the warp size.

Then each thread could potentially process a single bucket.

I am thinking maybe use broadcasting (or maybe voting).

Broadcast: broadcast each element from a bucket to the other 31 threads.

That way each thread can check if the broadcast element is inside it’s bucket.

If it’s inside it’s bucket, the bucket gets dismissed.

I am also thinking dynamic parallelism.

So basically each thread block of 32 threads select it’s favor bucket.

So each thread block has 1 remaining bucket.

This could mean the buckets have to be redistributed on next phase.

Another idea could be to use broadcast or voting or something to redistribute bucket id’s to threads/lanes so that dynamic parallism/relaunching is not needed.

But this would probably require a more global view of the situation or some synchronisation point…

A point at which the algorithm knows that part of the buckets are done processing and now it can safely be redistributed.

But the question is basically what is your hardware:

SM 3.0/kepler has a shuffle instruction which might be of some use.

So far I have seen broadcast be used to broadcast a single element to multiple threads.

In this case maybe it’s necessary to broadcast multiple elements to multiple threads.

If the later is not possible:

Perhaps there could be a main thread/master thread… broadcasting elements to other threads… so the threads can check if the element exists or not…

This is basically a naive algorithm ofcourse… but the idea is to try and use parallel features… to try and come up with an “embarrasingly” parallel algorithm… which may or may not be work efficient but at least it happens in parallel.

A single threaded/cpu version could use a bitset ofcourse on the integer set and dismiss buckets who’s integers are conflicting with the bitset.

Such an algorithm is straight forward on the cpu… perhaps this could be a nice “pre-processing” stage to combine cpu and gpu to make a hybrid algorithm:

```
ClearIntegerSet;
SetBucketsToValid;
for B := 0 to Buckets-1 do
begin
for I := 0 to Bucket[B].Integers-1 do
begin
vInteger := Bucket[B].Integer[I];
if IntegerSet[ vInteger ] then
begin
Bucket[B].Valid := False;
end else
begin
IntegerSet[ vInteger ] := True;
end;
end;
end;
```

This uses a hashing like/bucket like technique to quickly determine which integers are unique and which are not and thus which buckets do not contain unique integers.

This quick cpu solution assumes the integer range is limited, (tet id).

This would at least get rid of invalid buckets pretty quickly. Mostly because CPU has larger caches and better random access performance than a GPU… though in this case I am not so sure if the CPU can beat a broadcast like algorithm ;) Also depends on the GPU ofcourse… it’s watt usage etc.

Once the valid buckets remains… all that is left to do is determine the largest bucket size.

Assuming each bucket has a “length” field. This is a linear operation and can also be done very quickly… it can also be parallized… but since it’s speed is probably very fast on a cpu… it’s doubtfull if a gpu can exceed it. The overhead of copieing, launching, copieing results back etc… might not be worth it.

So this focussses the algorithm back on determining valid buckets versus invalid buckets… I think that’s where the main problem lies in your case.

Since the CPU version above is already very simple… it’s worth taking a look at it… to see if it can be made parallel.

First thing I notice is the integer set. I am thinking atomics to prevent race conditions when trying to set an integer/bit to true. That should solve the problem of “wanting to parallelize” the integer setting of the algorithm if that is the part of the algorithm to be parallized.

However it’s also possible to parallize the bucket part of the algorithm. So there are two possibility or even three:

- Process buckets in parallel.
- Process integers in parallel.
- Process both in parallel.

The ammount of resources available on the GPU kind of decides which option to go for.

Keeping scalability in mind it might be interesting to design an algorithm for option 3. But we also want to be realistic and create an algorithm which will run well on today’s hardware/cache designs and so forth.

It therefore feels natural to distribute the buckets among blocks, and the integers among threads.

There should be a shared data structure the “integer set”. This is where the GPU version is going to hurt.

Retrieving the integer set is going to be painfull… the data cache is to small to contain the entire integer set.

One idea which comes to mind is to process the entire bucket structure based on integer range.

This would mean to partially compute the integer set. For example the data cache could be 32 kilobytes. Divide by 4 this would give 8 kilo integers (assuming 32 bit integers, which could actually be 32 bit floating points which offer 24 bits of integer precision).

Thus roughly 8000 should be the integer range.

The entire bucket structure is then processed, all integers falling outside range 0 to 7999 are ignored.

This would imply a branch which would again be painfull and serialize the algorithm which would be bad.

However perhaps there is a trick to “nullify” invalid ranges.

I am thinking:

IntegerValue mod 8000

IntegerValid div 8000

IntegerValid cap 1

if IntegerValid is greater than 1 it would be capped at 1.

This way two integer tables could be created… a valid one and an invalid one.

This would split the cache in 2… but could still be of some use.

So perhaps range 4000 then.

Only the valid range is examined and used… the invalid range is ignored.

This way it can be very quickly determined if buckets contain invalid integers (integers which are not unique across the buckets).

By repeating this process for the remaining ranges… more buckets can be invalidated… until all ranges/the full range is processed.

Needless to say this would multiply the bandwidth by the number of ranges. But it should reduce the latency/memory access costs by quite a lot.

The point being:

If the memory latency is the bottleneck, then such a integer range skipping algorithm could circumvent the latency bottleneck.

However it then depends on other potential bottlenecks: bandwidth and processing.

If the GPU has plenty of bandwidth and plenty of processing/instructions per second… then this would be good.

What actually is the bottleneck is hard to say… it will depend on your problem size, how many buckets, how many integers, it will depend on your graphics card… how much memory, how many cuda cores, clock frequency etc, driver overhead, cuda launch overhead.

Very maybe a spreadsheet could be created to stuff different kinds of algorithm properties/numbers into it… and let the spreadsheet determine which algorithm would be best. The spreadsheet/solver would calculate performance expections/bottlenecks and then determine the best algorithm.

For software purpose: perhaps the final software could have multiple algorithms implemented.

They could either all be tried and the best selected, or the best one could be selected based on data/specifics of the problem size, assuming the calculations are good/precise… otherwise the first approach measuring.

The simplest algorithm could be to process a single bucket per thread, however I would expect latency problems and thus stalling problems… the adventage would be linear reading of each bucket memory/integers… at the expensive of latency problems when accessing the integer set. Even if millions of cuda cores, still bottlenecked by latency in integer set I think… and or hogging of memory system.

Other obsucrities: the hardware might and probably will change in the future.

For example my GT 520 has only a L1 cache as far as I know… (maybe I am wrong)… however apperently GTX 980 now has a secondary cache. L2. Some might not even have a L1 cache or might ignore it.

So some algorithms might have to be revised if more/other data caches available. For now the above algorithm only included considerations of L1 cache and not L2 cache.

Though it could pretend L2 cache to be L1 cache and thus more or less same algorithm.

So bottom line/conclusion could be:

- Simply try out different parallel algorithms and see which works best.

This conclusion kinda sucks… because the same applies to CPUs a little bit…

Apperently it’s hard to predict which algorithm will perform best… plus it may be problem size/data characteristics specific.

It would be funny to know which algorithm, option 1, option 2, option 3, or my whacky "integer range skipping algorithm) would work best ;)

It would also be interesting to compare it to a more high level approach like njuffa’s and common parallel algorithms applied.

I will leave it at that for now.