Partitioning an array of n-tuples

I’d like to partition an aray of n-tuples based on whether the one of the elements of a tuple is a specified number m.

For example the sequence of 2-tuples:

 1 , 4 , 
 1 , 3 , 
 2 , 3 , 
 1 , 2 , 
 2 , 2 , 
 3 , 2 , 
 1 , 1 , 
 2 , 1 , 
 3 , 1 , 
 4 , 1

Will be partitioned as follows, if m = 4:

 1 , 4 , 
 4 , 1 ,_________ 
 1 , 3 , 
 2 , 3 , 
 1 , 2 , 
 2 , 2 , 
 3 , 2 , 
 1 , 1 , 
 2 , 1 , 
 3 , 1

I’ve considered using thurst, unfortunately the examples given are based on paritioning arrays of scalars. My data is not laid out as a structure of arrays because I think doing so will kill coalescence of read requests because each element of a tuple will be equally spaced throughout the array. The partition can either be in-place or a copy of the input – whichever is easier to implement.

How might I proceed with this problem? I’m open to working with thrust and or CUB.

You can use Thrust’s partition algorithm over generic iterators so as long as you’re able to create a random access iterator to your data, you’re all fine.

You haven’t posted how your code is implemented but consider looking at something like zip_iterator which you can indeed successfully partition over.

I know of the zip_iterator. Problem is that this iterator is meant to be used to create a SoA out of an AoS. I think I’m going to have to implement my own iterator based on thrust::iterator_adaptor.

It’s unclear by “this iterator” whether you meant a zip iterator or whether you meant some other iterator that you have in mind.

With respect to zip iterator, I would say a zip iterator creates a AoS out of an underlying SoA.

Each dereference of the zip iterator produces a tuple. The tuple is effectively the desired structure. The zip iterator creates one of these “structures” on the fly, each time it is dereferenced, effectively creating an Array of Structures (AoS).

In normal CUDA programming, an actual underlying AoS storage scheme is considered the “bad” data organization method for efficient access. The SoA is the “good” method.

Therefore, the beauty of a zip iterator is that it allows for underlying SoA storage and its associated “good” access characteristics, while providing the programmer a view of the data as if it were AoS organized, so as for convenient programming (e.g. handling the color components of a pixel).

Getting back to your problem, it’s unclear what the data organization actually is, and what are the access patterns you expect.

But you can certainly use thrust to partition the data you have shown if it is organized as shown (which I would call an AoS).

Hi txbob,

By “this iterator” I was referring to the zip iterator. And yes you are right, it ‘creates’ an AoS from a SoA.

I know the common advise is to lay out data as an SoA but I fail to see how that is beneficial in the present case. I’m of the opinion, correct me if I am wrong, that an SoA is preferred if each element of the original structure were to be accessed independently. However, each one of n-tuples that I’m working with is always accessed in entirety in operations such as the partition predicate. If this is true what would be the advantage of storing the tuples as SoA like so:

x1 x2 … y1 y2 … z1 z2 …

It seems to me that accessing all the elements of such a single tuple would involve far-flung non-coalesced reads. Therefore perhaps the following layout is preferred:

x1 y1 z1 x2 y2 z2 …

Therefore I’m considering implementing a strided iterator for sequentially or randomly traversing entire tuplese.

Do please correct me if you think my reasoning is flawed.

Difficult to speak in generalities without seeing actual code, to determine the precise access patterns.

Thrust is built on top of CUDA C/C++ (at least, if you are using the CUDA backend). So if you are talking about a thrust tuple, I would liken it to a structure – there is no concept of a tuple in CUDA C/C++ (even if there were, we would have to understand how tuple access decomposes into fundamental read/write transactions - for that understanding it is sufficient to liken it to a structure).

We must ask, then, what does that structure look like, and how exactly is it accessed?

suppose I have a struct like this:

struct my_struct{
  int x;
  int y;
  int z;
};

Let’s say I have an array (AoS) of those:

my_struct A[1024];

Now what does the thread code look like? By your statements (" … each one of n-tuples that I’m working with is always accessed in entirety… ") I presume you mean some (CUDA thread) code like this:

my_struct my_tuple = A[idx];

what access patterns will be generated as a result of that? First of all, since this particular tuple happens to be 96 bits, there is no way for the CUDA compiler to resolve that into a single access.(*) It must do at best(assuming packed storage) a 64-bit access followed by a 32-bit access. In your packed storage case:

x1y1z1x2y2z2x3y3z3...
1 1 2 1 1 2 1 1 2...

we would have a result pattern where adjacent threads in the warp reading adjacent tuples would access first the elements marked by 1, then, later, in another read transaction, the elements marked by 2. So this is less than ideal efficiency - the profiler would indicate this.

If your tuple happened to be exactly 32, 64, or 128 bits in length, and you’re feeling lucky (betting that the compiler will figure out that this structure access can be combined into an elementary read) then you would get ideal, efficient access across the warp. The way to remove the luck would be to take the extra step of creating a union with int4 or some other construct in the struct that makes it painfully obvious to the compiler that you intend to load it all in a single transaction.

But barring the above specific cases, every other tuple configuration will result in inefficient access, to some degree. So AoS is risky, at best. In the general case, it simply cannot, and will not guarantee efficient access - pretty much no matter what you do in thread code.

If we convert to SoA, even if you intend to use the entire “structure” together, we can always, trivially, easily enforce perfectly optimal access, for any struct/tuple configuration:

struct my_struct2 {

  int x1[1024];
  float x2[1024];
  double x3[1024];
  bool x4[1024];
  char x5[1024];
}

struct my_tuple {
  int x1;
  float x2;
  double x3;
  bool x4;
  char x5;
}

my_struct2 A;

access code:

my_tuple t;
t.x1 = A.x1[idx];
t.x2 = A.x2[idx];
t.x3 = A.x3[idx];
t.x4 = A.x4[idx];
t.x5 = A.x5[idx];

// then process t...

this will always be perfectly efficient access, for fundamental types up to 128 bits, and assuming SoA storage:

x1x1x1…x2x2x2…x3x3x3…

which of course is implied in the SoA definition:

my_struct2 A;

And, beautifully, magically, the thrust zip_iterator concept takes the above messy thread access code in the SoA case, and turns it into something clean, and easy to understand from the programmers view, almost as if we were back to our clean looking AoS access code:

my_struct my_tuple = A[idx];

I think in general, using generic terms like “tuple” and “iterator” and “predicate” don’t provide enough specificity to understand this clearly. It’s necessary at least to get to the C code level, and perhaps even to the SASS code level, to understand what is efficient and what isn’t.

(*) http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory-accesses

Hi txbob.

Let me throw some more light on the problem that I’m trying to solve:

The kernels that I hope to write should be able to handle tuples of arbitrarily size. Tuples contains integers only and dimensions can be as large as 30. It is for these reasons that the input data is presented in array form and not as an explicit AoS.

The operations that I would like to perform on the tuples are as follows:

  1. Intersection i.e. for every pair of tuples (A and B) create a new tuple from the minimum of the elements of A and B as follows:
// Sequential Code
void intersect( ushort* src , ushort* dst , ushort dim , unsigned size )
{
	unsigned id = 0;

	for( unsigned i = 0; i < size - 1; ++i )
	{
		ushort* A = src + dim * i;

		for( unsigned j = i + 1; j < size; ++j )
		{
			ushort* B = src + dim * j;

			for( ushort k = 0; k < dim; ++k )
			{
				dst[id++] = min( A , B );
			}				
		}
	}
}

This intersection step is performed during each (one of the dim - 1) iterations of the algorithm. (Each iteration involves intersection, partition, filtering etc – see below) The memory access pattern performed by this kernel is clearly non-sequential. I have already written this kernel.
I’ve already asked a question about how to implement this “all pairs” kernel.

  1. Partition (this is the step that I’m currently working on):
    Given an array of tuples T, partition T based on whether each tuple that it contains has a maximal value m (4 in my original question).

Another step (TODO) in each iteration of the algorithm includes:

  1. Filter: given two arrays of tuples T and U. Remove all tuples in U that are present in T.

Wow, 30 elements per tuple? Yeesh.

Yeah, you can’t use Thrust for that. Their tuples cap out at like 10 elements :P

txbob. I’m adopting an SoA data layout/access pattern but I still need to implement a stride iterator. Please take a look at my latest question.