Is the GPU better than the CPU for what I want?

Okay, I have a problem.

I have n buckets of integers. Each bucket is also named after an integer.

So bucket 0 can contain, for example, 0, 1, 5, 13, 7, 8, 2,…

Basically, any integer in any order can be found in a bucket.

What I want to do is find a combination of buckets such that the contents of each bucket are unique. I also want the algorithm to favor bucket size as well (bigger is better). And by find, I mean we do something like make a boolean array true for that bucket index. We’re “nominating” buckets here and we want the combination that uses the largest buckets possible and also ensures that each number in a bucket is the only one nominated out of all the other buckets.

Can the GPU even do something like this efficiently compared to the CPU? I’ve been thinking about this and a CPU version is significantly easier to code and it also seems like it’d be far more performant as well.

I am having a difficult time figuring out the specifications of your algorithm. Is it a knapsack problem? If not, is there a name by which it is commonly known? If there is, you may want to go to Google Scholar (https://scholar.google.com) and search for the name of the algorithm plus terms like “GPU” and “CUDA”.

Okay, I’ll go into more specifics then.

What makes my meshing special is that I’m not scared of points being “on” a tetrahedron. This means a point can be directly on the surface or edge of a tetrahedron and as fate would have have it, tetrahedra can share these edges and faces. Only two can share a face (duh!) but any number can share an edge. It sort of looks like a bouquet of tetrahedra in practice. Kind of pretty.

So in my case, I need to decide which points I should insert into the mesh. The easiest way to manage all of this is to use buckets. Each point is basically an index in a memory buffer and tetrahedra are the same way with the mesh. They’re all just offsets in a malloc’d buffer.

So I’m using the point ids as bucket ids and the bucket contents are the tetrahedron ids. The only problem is, only one point can fracture a tetrahedron. So if a point is nominated for insertion into the mesh, we have to make sure that no two points (or more) are being inserted into the same tetrahedron.

Here’s an example below :

buckets[0] = { 0, 1, 2 };
buckets[1] = { 3, 1, 6, 63 };
buckets[2] = { 17, 19 };
buckets[3] = { 99, 100, 101, 102, 103 };

Okay, let’s drop the “find the biggest bucket” thing. That might be too hard. Let’s just start with conflict resolution.

I want an algorithm that’d output valid bucket nominations.

Buckets 0 and 1 can’t both be nominated because they share some of the same contents. So it’s either bucket 0 is nominated or bucket 1 is nominated.

Let me parrot that back: Given a set of integers A, find all sets of integers B_0, …, B_n whose intersection with A is the empty set. From among B_0, …, B_n, return the set with the greatest cardinality.

How many sets are we talking about? If there are thousands to tens of thousands of sets, this would seem well suited to parallel execution on the GPU. In a first step, in parallel, intersect all possible sets with set A, marking the sets whose intersection with A is empty. We could follow this up by stream compaction (to extract all the sets whose intersection with A is empty) followed by reduction to find the set with maximum cardinality amongst those.

But it is probably easier to simply assign a cardinality of -1 to those sets whose intersection with A is not the empty set, and then do a reduction for finding the set with maximum cardinality.

njuffa, as always, your insights have been awesome!

Though I must admit I had to google “cardinality”. Out of curiosity, are you like a math person? I’ve only seen that term in actual math courses in college.

Anyway, let’s talk about implementation details now.

I’m able to build the buckets relatively quickly.

The buckets are stored in this manner:

const int size_bucket_contents = /* ... */ ;
const int num_buckets

thrust::device_vector<int> bucket_contents(size_bucket_contents),
                           which_bucket(size_bucket_contents), // which_bucket bucket_contents[i] is in
                           bucket_starts(num_buckets);

/*
    relevant assignment code
*/

// sort everything by which bucket they're in
thrust::sort_by_key(which_bucket.begin(), which_bucket.end(), bucket_contents.begin());

/*
    bucket_starts works like this :
    bucket id = i
    bucket i starts in bucket_contents at the index
    bucket_starts[i], hence its size at num_buckets
*/

find_bucket_starts<<< blocks_per_grid, threads_per_block >>>(/* ... */);

Well, this is one thing that’s possible. Basically, the bucket contents have an array aligned to them which marks which bucket they belong to.

So it sounds like what I should do is :

  1. Sort by bucket size to get the initial largest bucket(s)
  2. Compare it to the remaining sets for intersections
  3. Remove intersecting sets
  4. Of the remaining sets, mark the largest one(s)
  5. Repeat until empty or all marked

Does that sound about right?

I have a university degree in computer science, does that answer your question?

There seems to be no need to sort anything. I was envisioning something like this:

phase I:

Given: A set of integers A, m sets of integers B, plus an array of ‘float’ R of length m in global memory. Start m threads. Each thread i forms the intersection of set B[i] with A, in the process determining the cardinality of B[i]. If the intersection is empty, thread i writes R[i] = cardinality(B[i]) + 1.0f, else it writes R[i] = 0.0f.

phase II:

Reduction of the array R[i], applying the max() operation, while keeping track of the position of the maximum element. This can be done with the CUBLAS operation ISAMAX, which finds the index j of the element with the maximum absolute value in a 1-D array of ‘float’. Since all R[m] are positive, this works fine here. The desired set is B[j] and the cardinality of B[j] is (R[j] - 1), in case the latter information is needed.

Left as an excercise for the reader: Intersecting B[i] with A, while determining cardinality(B[i]) at the same time. The specifics will depend on the range and distribution of the cardinality and how the sets are represented. Some data structures for sets may include cardinality directly so it is a known input and you simply need to focus on finding the intersection of A with B[i].

Obvious variations of the above scheme are possible. Maybe each thread handles several sets, or one set is handled by an entire warp, etc.

I am thinking a divide and conquer algorithm where buckets are processed in groups of 32, which matches the warp size.

Then each thread could potentially process a single bucket.

I am thinking maybe use broadcasting (or maybe voting).

Broadcast: broadcast each element from a bucket to the other 31 threads.

That way each thread can check if the broadcast element is inside it’s bucket.

If it’s inside it’s bucket, the bucket gets dismissed.

I am also thinking dynamic parallelism.

So basically each thread block of 32 threads select it’s favor bucket.

So each thread block has 1 remaining bucket.

This could mean the buckets have to be redistributed on next phase.

Another idea could be to use broadcast or voting or something to redistribute bucket id’s to threads/lanes so that dynamic parallism/relaunching is not needed.

But this would probably require a more global view of the situation or some synchronisation point…

A point at which the algorithm knows that part of the buckets are done processing and now it can safely be redistributed.

But the question is basically what is your hardware:

SM 3.0/kepler has a shuffle instruction which might be of some use.

So far I have seen broadcast be used to broadcast a single element to multiple threads.

In this case maybe it’s necessary to broadcast multiple elements to multiple threads.

If the later is not possible:

Perhaps there could be a main thread/master thread… broadcasting elements to other threads… so the threads can check if the element exists or not…

This is basically a naive algorithm ofcourse… but the idea is to try and use parallel features… to try and come up with an “embarrasingly” parallel algorithm… which may or may not be work efficient but at least it happens in parallel.

A single threaded/cpu version could use a bitset ofcourse on the integer set and dismiss buckets who’s integers are conflicting with the bitset.

Such an algorithm is straight forward on the cpu… perhaps this could be a nice “pre-processing” stage to combine cpu and gpu to make a hybrid algorithm:

ClearIntegerSet;

SetBucketsToValid;

for B := 0 to Buckets-1 do
begin
  for I := 0 to Bucket[B].Integers-1 do
  begin
    vInteger := Bucket[B].Integer[I];
    if IntegerSet[ vInteger ] then
    begin
      Bucket[B].Valid := False;
    end else
    begin
      IntegerSet[ vInteger ] := True; 
    end;
  end;
end;

This uses a hashing like/bucket like technique to quickly determine which integers are unique and which are not and thus which buckets do not contain unique integers.

This quick cpu solution assumes the integer range is limited, (tet id).

This would at least get rid of invalid buckets pretty quickly. Mostly because CPU has larger caches and better random access performance than a GPU… though in this case I am not so sure if the CPU can beat a broadcast like algorithm ;) Also depends on the GPU ofcourse… it’s watt usage etc.

Once the valid buckets remains… all that is left to do is determine the largest bucket size.

Assuming each bucket has a “length” field. This is a linear operation and can also be done very quickly… it can also be parallized… but since it’s speed is probably very fast on a cpu… it’s doubtfull if a gpu can exceed it. The overhead of copieing, launching, copieing results back etc… might not be worth it.

So this focussses the algorithm back on determining valid buckets versus invalid buckets… I think that’s where the main problem lies in your case.

Since the CPU version above is already very simple… it’s worth taking a look at it… to see if it can be made parallel.

First thing I notice is the integer set. I am thinking atomics to prevent race conditions when trying to set an integer/bit to true. That should solve the problem of “wanting to parallelize” the integer setting of the algorithm if that is the part of the algorithm to be parallized.

However it’s also possible to parallize the bucket part of the algorithm. So there are two possibility or even three:

  1. Process buckets in parallel.
  2. Process integers in parallel.
  3. Process both in parallel.

The ammount of resources available on the GPU kind of decides which option to go for.

Keeping scalability in mind it might be interesting to design an algorithm for option 3. But we also want to be realistic and create an algorithm which will run well on today’s hardware/cache designs and so forth.

It therefore feels natural to distribute the buckets among blocks, and the integers among threads.

There should be a shared data structure the “integer set”. This is where the GPU version is going to hurt.

Retrieving the integer set is going to be painfull… the data cache is to small to contain the entire integer set.

One idea which comes to mind is to process the entire bucket structure based on integer range.

This would mean to partially compute the integer set. For example the data cache could be 32 kilobytes. Divide by 4 this would give 8 kilo integers (assuming 32 bit integers, which could actually be 32 bit floating points which offer 24 bits of integer precision).

Thus roughly 8000 should be the integer range.

The entire bucket structure is then processed, all integers falling outside range 0 to 7999 are ignored.

This would imply a branch which would again be painfull and serialize the algorithm which would be bad.

However perhaps there is a trick to “nullify” invalid ranges.

I am thinking:

IntegerValue mod 8000
IntegerValid div 8000
IntegerValid cap 1

if IntegerValid is greater than 1 it would be capped at 1.

This way two integer tables could be created… a valid one and an invalid one.

This would split the cache in 2… but could still be of some use.

So perhaps range 4000 then.

Only the valid range is examined and used… the invalid range is ignored.

This way it can be very quickly determined if buckets contain invalid integers (integers which are not unique across the buckets).

By repeating this process for the remaining ranges… more buckets can be invalidated… until all ranges/the full range is processed.

Needless to say this would multiply the bandwidth by the number of ranges. But it should reduce the latency/memory access costs by quite a lot.

The point being:

If the memory latency is the bottleneck, then such a integer range skipping algorithm could circumvent the latency bottleneck.

However it then depends on other potential bottlenecks: bandwidth and processing.

If the GPU has plenty of bandwidth and plenty of processing/instructions per second… then this would be good.

What actually is the bottleneck is hard to say… it will depend on your problem size, how many buckets, how many integers, it will depend on your graphics card… how much memory, how many cuda cores, clock frequency etc, driver overhead, cuda launch overhead.

Very maybe a spreadsheet could be created to stuff different kinds of algorithm properties/numbers into it… and let the spreadsheet determine which algorithm would be best. The spreadsheet/solver would calculate performance expections/bottlenecks and then determine the best algorithm.

For software purpose: perhaps the final software could have multiple algorithms implemented.

They could either all be tried and the best selected, or the best one could be selected based on data/specifics of the problem size, assuming the calculations are good/precise… otherwise the first approach measuring.

The simplest algorithm could be to process a single bucket per thread, however I would expect latency problems and thus stalling problems… the adventage would be linear reading of each bucket memory/integers… at the expensive of latency problems when accessing the integer set. Even if millions of cuda cores, still bottlenecked by latency in integer set I think… and or hogging of memory system.

Other obsucrities: the hardware might and probably will change in the future.

For example my GT 520 has only a L1 cache as far as I know… (maybe I am wrong)… however apperently GTX 980 now has a secondary cache. L2. Some might not even have a L1 cache or might ignore it.

So some algorithms might have to be revised if more/other data caches available. For now the above algorithm only included considerations of L1 cache and not L2 cache.

Though it could pretend L2 cache to be L1 cache and thus more or less same algorithm.

So bottom line/conclusion could be:

  1. Simply try out different parallel algorithms and see which works best.

This conclusion kinda sucks… because the same applies to CPUs a little bit…

Apperently it’s hard to predict which algorithm will perform best… plus it may be problem size/data characteristics specific.

It would be funny to know which algorithm, option 1, option 2, option 3, or my whacky "integer range skipping algorithm) would work best ;)

It would also be interesting to compare it to a more high level approach like njuffa’s and common parallel algorithms applied.

I will leave it at that for now.

Whoa… O_o

Okay, I can try some of that. I’m going to reread the algorithms before I decide which one to implement first.

Oh yeah, I’m not sure if it was emphasized enough in my other posts but the main set (set A, by njuffa’s example) grows. We choose a non-conflicting set and append it to the set we’re testing against (A).

So technically, the set grows as the algorithm continues and as other sets are marked as “alive” or “dead”.

I’m not sure how much this changes things.

Assuming 31 to 32 bit integers then 500 MB is enough to store an entire integer set (bit set) with 1 bit per integer.

So with nowadays hardware that should be good.

The rest of the memory can then be used for the buckets…

Memory constraints are one thing (though it is nice to know that I can fit everything in something manageable [I have a 750 Ti, so 2 GB of VRAM is plenty]) but I was more or less concerned about global synchronization.

The code here seems to suggest choosing buckets one at a time which means that I’d have to call this kernel repeatedly from the host (the easiest way to emulate global synchronization). I’m curious if repeatedly calling these kernels would degrade performance or if I’m focusing too much on the small details at this point.

I wish I had more time to code lol T_T

The only part which probably needs “race condition protection” is the bitset/integerset.

That’s what atomics are for… however since these are bits not sure how that would work out.

I’m not entirely sure about that. I’m not sure how well I emphasized this point but the algorithm needs to find the union of all sets such that no sets contain mutual elements.

What njuffa posted is good but it only finds one set which means I have to unite with the original set and then retest and repeat until all sets are either tossed aside or part of the union.

I might have to reread your post though because it was quite lengthy and I’m not sure if you addressed it or not. If you did, my bad. I’ll read more carefully in the future.

The Delphi Code/Algorithm is so simple it’s no problem for me to convert it to a C/C++ version.

Now that it ran I noticed a problem with it, which is kinda funny, it doesn’t mark both buckets as invalid.

So for now the algorithm is slightly wrong, but perhaps it can be corrected :)

// Test.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"


const int Buckets = 4;
const int Integers = 5;
const int MaxInteger = 104;

int _tmain(int argc, _TCHAR* argv[])
{	
	int Bucket[Buckets][Integers];	
	int BucketValid[Buckets];
	int BucketIntegers[Buckets];

	int IntegerSet[MaxInteger];

	int vIndex;
	int vBucketIndex;
	int vBucketIntegerIndex;
	int vInteger;

	int vBiggestBucketIntegers;
	int vBiggestBucketIndex;

//    buckets[0] = { 0, 1, 2 };
//    buckets[1] = { 3, 1, 6, 63 };
//    buckets[2] = { 17, 19 };
//    buckets[3] = { 99, 100, 101, 102, 103 };

	// bucket 0
	Bucket[0][0] = 0;
	Bucket[0][1] = 1;
	Bucket[0][2] = 2;
	BucketIntegers[0] = 3;

	// bucket 1
	Bucket[1][0] = 3;
	Bucket[1][1] = 1;
	Bucket[1][2] = 6;
	Bucket[1][3] = 63;
	BucketIntegers[1] = 4;

	// bucket 2
	Bucket[2][0] = 17;
	Bucket[2][1] = 19;
	BucketIntegers[2] = 2;

	// bucket 3
	Bucket[3][0] = 99;
	Bucket[3][1] = 100;
	Bucket[3][2] = 101;
	Bucket[3][3] = 102;
	Bucket[3][4] = 103;
	BucketIntegers[3] = 5;

	// set all buckets to true
	for (vIndex = 0; vIndex < Buckets; vIndex++)
	{
		BucketValid[vIndex] = 1;
	}

	// clear integer set
	for (vIndex = 0; vIndex < MaxInteger; vIndex++)
	{
		IntegerSet[vIndex] = 0;
	}

	// problem with algorithm, bucket 0 should have been marked as invalid too ! :)
	for (vBucketIndex = 0; vBucketIndex < Buckets; vBucketIndex++)
	{
		for (vBucketIntegerIndex = 0; vBucketIntegerIndex < BucketIntegers[vBucketIndex]; vBucketIntegerIndex++ )
		{
			vInteger = Bucket[vBucketIndex][vBucketIntegerIndex];

			if (IntegerSet[vInteger] == 1)
			{
				BucketValid[vBucketIndex] = 0;
			} else
			{
				IntegerSet[vInteger] = 1;
			}
		}
	}

	for (vBucketIndex = 0; vBucketIndex < Buckets; vBucketIndex++)
	{
		printf("BucketValid[%d]: %d \n", vBucketIndex, BucketValid[vBucketIndex] );
	}

	// find biggest valid bucket:
	vBiggestBucketIntegers = -1;
	vBiggestBucketIndex = -1;

	for (vBucketIndex = 0; vBucketIndex < Buckets; vBucketIndex++)
	{
		if (BucketIntegers[vBucketIndex] > vBiggestBucketIntegers)
		{
			vBiggestBucketIntegers = BucketIntegers[vBucketIndex];
			vBiggestBucketIndex = vBucketIndex;
		}
	}

	if (vBiggestBucketIndex > 0) 
	{
		printf("Biggest Bucket: %d \n", vBiggestBucketIndex );
	} else
	{
		printf("No valid buckets \n");
	}


	return 0;
}

This will print out:

BucketValid[0]: 1
BucketValid[1]: 0
BucketValid[2]: 1
BucketValid[3]: 1
Biggest Bucket: 3

The zero indicates bucket 1 is indeed invalid.

My first idea to correct my algorithm is kinda cheasy… simply walk across invalid buckets… and check the valid buckets… to see if they contain integers from the invalid buckets to mark them as invalid too… perhaps this would be enough to fix it.

(Which probably is no better or maybe slight better than the brute force approach ? Hmmm… maybe it’s interesting to try a brute force approach first because it could be parallelized ?)

Ok this fixes the algorithm somewhat lol… it kinda funny… at least there now is a brute force like algorithm and my honor has somewhat been saved lol:

// Test.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"

const int Buckets = 4;
const int Integers = 5;
const int MaxInteger = 104;

int _tmain(int argc, _TCHAR* argv[])
{	
	int Bucket[Buckets][Integers];	
	int BucketValid[Buckets];
	int BucketIntegers[Buckets];

	int IntegerSet[MaxInteger];

	int vIndex;
	int vBucketIndex;
	int vBucketIntegerIndex;
	int vInvalidBucketIndex;
	int vInvalidBucketIntegerIndex;
	int vInteger;

	int vBiggestBucketIntegers;
	int vBiggestBucketIndex;

//    buckets[0] = { 0, 1, 2 };
//    buckets[1] = { 3, 1, 6, 63 };
//    buckets[2] = { 17, 19 };
//    buckets[3] = { 99, 100, 101, 102, 103 };

	// bucket 0
	Bucket[0][0] = 0;
	Bucket[0][1] = 1;
	Bucket[0][2] = 2;
	BucketIntegers[0] = 3;

	// bucket 1
	Bucket[1][0] = 3;
	Bucket[1][1] = 1;
	Bucket[1][2] = 6;
	Bucket[1][3] = 63;
	BucketIntegers[1] = 4;

	// bucket 2
	Bucket[2][0] = 17;
	Bucket[2][1] = 19;
	BucketIntegers[2] = 2;

	// bucket 3
	Bucket[3][0] = 99;
	Bucket[3][1] = 100;
	Bucket[3][2] = 101;
	Bucket[3][3] = 102;
	Bucket[3][4] = 103;
	BucketIntegers[3] = 5;

	// set all buckets to true
	for (vIndex = 0; vIndex < Buckets; vIndex++)
	{
		BucketValid[vIndex] = 1;
	}

	// clear integer set
	for (vIndex = 0; vIndex < MaxInteger; vIndex++)
	{
		IntegerSet[vIndex] = 0;
	}

	// problem with algorithm, bucket 0 should have been marked as invalid too ! :)
	for (vBucketIndex = 0; vBucketIndex < Buckets; vBucketIndex++)
	{
		for (vBucketIntegerIndex = 0; vBucketIntegerIndex < BucketIntegers[vBucketIndex]; vBucketIntegerIndex++ )
		{
			vInteger = Bucket[vBucketIndex][vBucketIntegerIndex];

			if (IntegerSet[vInteger] == 1)
			{
				BucketValid[vBucketIndex] = 0;
			} else
			{
				IntegerSet[vInteger] = 1;
			}
		}
	}

	// fix for algorithm, brute force approach-like:
	for (vInvalidBucketIndex = 0; vInvalidBucketIndex < Buckets; vInvalidBucketIndex++)
	{
		if (BucketValid[vInvalidBucketIndex] == 0)
		{
			// check valid buckets, if they contain integers from invalid buckets mark them as invalid too
			for (vBucketIndex = 0; vBucketIndex < Buckets; vBucketIndex++)
			{
				for (vInvalidBucketIntegerIndex = 0; vInvalidBucketIntegerIndex < BucketIntegers[vInvalidBucketIndex]; vInvalidBucketIntegerIndex++ )
				{
					for (vBucketIntegerIndex = 0; vBucketIntegerIndex < BucketIntegers[vBucketIndex]; vBucketIntegerIndex++ )
					{
						if (Bucket[vInvalidBucketIndex][vInvalidBucketIntegerIndex] == Bucket[vBucketIndex][vBucketIntegerIndex])
						{
							BucketValid[vBucketIndex] = 0;
							break;
						}
					}
					if (BucketValid[vBucketIndex] == 0) break;
				}
				if (BucketValid[vBucketIndex] == 0) break;
			}
		}
	}
		
	for (vBucketIndex = 0; vBucketIndex < Buckets; vBucketIndex++)
	{
		printf("BucketValid[%d]: %d \n", vBucketIndex, BucketValid[vBucketIndex] );
	}

	// find biggest valid bucket:
	vBiggestBucketIntegers = -1;
	vBiggestBucketIndex = -1;

	for (vBucketIndex = 0; vBucketIndex < Buckets; vBucketIndex++)
	{
		if (BucketIntegers[vBucketIndex] > vBiggestBucketIntegers)
		{
			vBiggestBucketIntegers = BucketIntegers[vBucketIndex];
			vBiggestBucketIndex = vBucketIndex;
		}
	}

	if (vBiggestBucketIndex > 0) 
	{
		printf("Biggest Bucket: %d \n", vBiggestBucketIndex );
	} else
	{
		printf("No valid buckets \n");
	}

	return 0;
}

BucketValid[0]: 0
BucketValid[1]: 0
BucketValid[2]: 1
BucketValid[3]: 1
Biggest Bucket: 3

From the looks of it it’s a 4 for-loop-level algorithm… so it seems to be N^4 compute problem/complexity. (4D compute problem/complexity).

Here is version 2 of the algorithm. Instead of using a bitset it would use a counter per integer:
(Reduces complexity from 4D back to roughly 2D)

// Test.cpp : Defines the entry point for the console application.
//

// algorithm version 2

#include "stdafx.h"

const int Buckets = 4;
const int Integers = 5;
const int MaxInteger = 104;

int _tmain(int argc, _TCHAR* argv[])
{	
	int Bucket[Buckets][Integers];	
	int BucketValid[Buckets];
	int BucketIntegers[Buckets];

	int IntegerCount[MaxInteger];

	int vIndex;
	int vBucketIndex;
	int vBucketIntegerIndex;
	int vInvalidBucketIndex;
	int vInvalidBucketIntegerIndex;
	int vInteger;

	int vBiggestBucketIntegers;
	int vBiggestBucketIndex;

//    buckets[0] = { 0, 1, 2 };
//    buckets[1] = { 3, 1, 6, 63 };
//    buckets[2] = { 17, 19 };
//    buckets[3] = { 99, 100, 101, 102, 103 };

	// bucket 0
	Bucket[0][0] = 0;
	Bucket[0][1] = 1;
	Bucket[0][2] = 2;
	BucketIntegers[0] = 3;

	// bucket 1
	Bucket[1][0] = 3;
	Bucket[1][1] = 1;
	Bucket[1][2] = 6;
	Bucket[1][3] = 63;
	BucketIntegers[1] = 4;

	// bucket 2
	Bucket[2][0] = 17;
	Bucket[2][1] = 19;
	BucketIntegers[2] = 2;

	// bucket 3
	Bucket[3][0] = 99;
	Bucket[3][1] = 100;
	Bucket[3][2] = 101;
	Bucket[3][3] = 102;
	Bucket[3][4] = 103;
	BucketIntegers[3] = 5;

	// set all buckets to true
	for (vIndex = 0; vIndex < Buckets; vIndex++)
	{
		BucketValid[vIndex] = 1;
	}

	// clear integer count
	for (vIndex = 0; vIndex < MaxInteger; vIndex++)
	{
		IntegerCount[vIndex] = 0;
	}

	// count integers from all buckets
	for (vBucketIndex = 0; vBucketIndex < Buckets; vBucketIndex++)
	{
		for (vBucketIntegerIndex = 0; vBucketIntegerIndex < BucketIntegers[vBucketIndex]; vBucketIntegerIndex++ )
		{
			vInteger = Bucket[vBucketIndex][vBucketIntegerIndex];

			IntegerCount[vInteger]++;
		}
	}

	// algorithm version 2
	for (vBucketIndex = 0; vBucketIndex < Buckets; vBucketIndex++)
	{
		for (vBucketIntegerIndex = 0; vBucketIntegerIndex < BucketIntegers[vBucketIndex]; vBucketIntegerIndex++ )
		{
			vInteger = Bucket[vBucketIndex][vBucketIntegerIndex];

			if (IntegerCount[vInteger] > 1)
			{
				BucketValid[vBucketIndex] = 0;
			}
		}
	}

	for (vBucketIndex = 0; vBucketIndex < Buckets; vBucketIndex++)
	{
		printf("BucketValid[%d]: %d \n", vBucketIndex, BucketValid[vBucketIndex] );
	}

	// find biggest valid bucket:
	vBiggestBucketIntegers = -1;
	vBiggestBucketIndex = -1;

	for (vBucketIndex = 0; vBucketIndex < Buckets; vBucketIndex++)
	{
		if (BucketIntegers[vBucketIndex] > vBiggestBucketIntegers)
		{
			vBiggestBucketIntegers = BucketIntegers[vBucketIndex];
			vBiggestBucketIndex = vBucketIndex;
		}
	}

	if (vBiggestBucketIndex > 0) 
	{
		printf("Biggest Bucket: %d \n", vBiggestBucketIndex );
	} else
	{
		printf("No valid buckets \n");
	}

	return 0;
}

Depending on the number of max integers could still be of some use.

I will definitely read your posts laster, Skybuck. I don’t have much time right now but this is what I’ve been working on this morning :

#include <iostream>

#include <curand_kernel.h>
#include <thrust/sort.h>

#include <thrust/device_ptr.h>
#include <thrust/device_vector.h>


const int tpb = 256,
		  bpg = 512,
		  grid_size = tpb * bpg;

enum { alive, dead, unsure };

template<typename T>
__forceinline__
T* raw_cast(const thrust::device_ptr<T> x)
{
	return x.get();
}

__forceinline__
__device__
int currThreadID(void)
{
	return blockIdx.x * blockDim.x + threadIdx.x;
}

__global__
void find_nonconflicting(const int num_buckets,
						 int *status,
						 const int *bucket_starts,
						 int *marked_tets,
						 const int *bucket_contents)
{
	const int thread_id = currThreadID();

	for (int tid = thread_id; tid < num_buckets; tid += grid_size)
	{
		// if this bucket has already been dealt with, return
		if (status[tid] == alive || status[tid] == dead)
			return;

		const int begin = (tid > 0 ? bucket_starts[tid - 1] : 0),
		          end   = bucket_starts[tid];

		for (int i = begin; i < end; ++i)
		{

			printf("thread %d is adding 1 to location %d\n", tid, bucket_contents[i]);
			if (marked_tets[bucket_contents[i]] >= 1)
			{
				for ( ; i >= begin; --i)
					atomicSub(marked_tets + bucket_contents[i], 1);

				return;
			}

			atomicAdd(marked_tets + bucket_contents[i], 1);
		}
	}
}

__global__
void find_boundaries(const int  num_keys,
                     const int  num_buckets,
                     const int *which_bucket,
                           int *bucket_starts)
{
    const int thread_id = currThreadID();

    for (int tid = thread_id; tid < num_keys; tid += grid_size)
    {
        // get start and end of each bucket
        const int begin = (tid > 0 ? which_bucket[tid - 1] : 0);
        const int end   = which_bucket[tid];

        // if bucket has length > 0...
        if (begin != end)
        {
            for (int i = begin; i < end; ++i)
            {
                // sets bucket starts value to index of bucket
                bucket_starts[i] = tid;
            }
        }

        // last thread writes number of elements to the rest
        // of the array
        if (tid == num_keys - 1)
        {
            for (int i = end; i < num_buckets; ++i)
            {
                bucket_starts[i] = num_keys;
            }
        }
    }
}

__global__
void init_bucket_contents(int *which_bucket,
		                  int *bucket_contents,
		                  const int contents_size,
		                  const int num_buckets)
{
	const int thread_id = currThreadID();

	for (int tid = thread_id; tid < contents_size; tid += grid_size)
	{
		unsigned int seed = tid;
		curandState s;
		curand_init(seed, 0, 0, &s);

		const float flt_rand_bucket  = curand_uniform(&s),
			        flt_rand_content = curand_uniform(&s);

		//if (tid == 0 || tid == 1 || tid == 2 || tid == 4 || tid == 8)
			//printf("%f\n", flt_rand_bucket);

		const int int_rand_bucket  = (int ) (flt_rand_bucket * num_buckets),
			      int_rand_content = (int ) (flt_rand_content * contents_size);

		which_bucket[tid] = int_rand_bucket;
		bucket_contents[tid] = int_rand_content;
	}
}

void nominate_points(void)
{
	const int contents_size = 10;
	const int num_buckets = 4;

	thrust::device_vector<int> which_bucket(contents_size),
							   bucket_contents(contents_size),
							   bucket_starts(num_buckets),
							   status(num_buckets, unsure),
							   marked_tets(contents_size, 0);

	/*init_bucket_contents<<< bpg, tpb>>>
			            (raw_cast(which_bucket.data()),
						 raw_cast(bucket_contents.data()),
						 contents_size,
						 num_buckets);

	thrust::sort_by_key(which_bucket.begin(), which_bucket.end(), bucket_contents.begin());*/

	which_bucket[0] = 0;
	which_bucket[1] = 0;
	which_bucket[2] = 0;
	which_bucket[3] = 1;
	which_bucket[4] = 1;
	which_bucket[5] = 2;
	which_bucket[6] = 2;
	which_bucket[7] = 2;
	which_bucket[8] = 3;
	which_bucket[9] = 3;

	bucket_contents[0] = 0;
	bucket_contents[1] = 1;
	bucket_contents[2] = 2;
	bucket_contents[3] = 2;
	bucket_contents[4] = 3;
	bucket_contents[5] = 4;
	bucket_contents[6] = 5;
	bucket_contents[7] = 6;
	bucket_contents[8] = 4;
	bucket_contents[9] = 7;

	find_boundaries<<< bpg, tpb >>>
				   (contents_size,
	                num_buckets,
	                raw_cast(which_bucket.data()),
	                raw_cast(bucket_starts.data()));

	find_nonconflicting<<< bpg, tpb >>>
					   (num_buckets,
						raw_cast(status.data()),
						raw_cast(bucket_starts.data()),
						raw_cast(marked_tets.data()),
						raw_cast(bucket_contents.data()));

	cudaDeviceSynchronize();

	for (int i = 0; i < 10; ++i)
		std::cout << which_bucket[i] << " => " << bucket_contents[i] << " nm : " << marked_tets[i] << std::endl;
}

int main(void)
{
	nominate_points();

	return 0;
}

It seems to work somewhat except on my values in marked_tets is 2 instead of 1 which I’m not sure why. I also think my marked_tets needs to be volatile to work properly but I’m not sure.

Okay, here’s the kernel I finally came up with. I think it’s O(n) time but the reads are kind of strided in global memory so that’s one immediate bottleneck.

I think I can remedy it though which I’ll talk about after I post the kernel :

__forceinline__
__device__
int currThreadID(void)
{
	return blockIdx.x * blockDim.x + threadIdx.x;
}

__global__
void find_nonconflicting(const int num_buckets,
						 int *status,
						 const int *bucket_starts,
						 int *marked_tets,
						 const int *bucket_contents,
						 int *nm)
{
	const int thread_id = currThreadID();

	// one thread per bucket
	for (int tid = thread_id; tid < num_buckets; tid += grid_size)
	{
		// I might cut out this part of the code
		// if this bucket has already been dealt with, return
		if (status[tid] == alive || status[tid] == dead)
			return;

		// get start and end indices of the bucket contents
		const int begin = (tid > 0 ? bucket_starts[tid - 1] : 0),
		          end   = bucket_starts[tid];

		// iterate the bucket
		for (int i = begin; i < end; ++i)
		{
			// id of the tetrahedron is the content of the bucket
			const int tetra_id = bucket_contents[i];

			// get the value of the boolean int array
			// if it's 0, awesome, this thread marked the tet first
			// if it's 1, this thread is too late
			const int old = atomicCAS(marked_tets + tetra_id, 0, 1);

			// if the thread contains a previously nominated tetrahedron,
			// we de-nominate it
			if (old == 1)
			{
				for (int j = begin; j < end; ++j)
					--nm[j];

				return;
			}
		}

	}
}

Basically, bucket contents are a 1d array and which bucket they’re in is marked by another 1d array aligned to it. We launch a separate kernel to find the bucket boundaries which creates bucket_starts which is of size num_buckets. For more info, check out this paper because I just copied the algorithm from it : http://idav.ucdavis.edu/~dfalcant/downloads/dissertation.pdf

marked_tets is a type of boolean array. I decided to drop the idea of finding the largest buckets because it’s too much work for probably not that much of a gain.

I use thread races to nominate buckets, basically. If a thread is the first to nominate a tetrahedron to have a point inserted into it, that one “wins”. I think this code should work.

I tried it with this input :

which_bucket[0] = 0;	bucket_contents[0] = 0;
	which_bucket[1] = 0;	bucket_contents[1] = 1;
	which_bucket[2] = 0;	bucket_contents[2] = 2;
	which_bucket[3] = 1;	bucket_contents[3] = 2;
	which_bucket[4] = 1;	bucket_contents[4] = 3;
	which_bucket[5] = 2;	bucket_contents[5] = 4;
	which_bucket[6] = 2;	bucket_contents[6] = 5;
	which_bucket[7] = 2;	bucket_contents[7] = 6;
	which_bucket[8] = 3;	bucket_contents[8] = 4;
	which_bucket[9] = 3;	bucket_contents[9] = 7;

And I got this output :

0 => 0 nm : 0
0 => 1 nm : 0
0 => 2 nm : 0
1 => 2 nm : 1
1 => 3 nm : 1
2 => 4 nm : 1
2 => 5 nm : 1
2 => 6 nm : 1
3 => 4 nm : 0
3 => 7 nm : 0

So the algorithm seems to work and I’m happy. I’m not sure if it’s the most efficient but it works.

One optimization I want to make is to read in segments of the bucket contents using int4 or other vector types. This way, the striding costs are diminished.