Thrust sort Need some help

Hello,

I would like to sort an a device array of struct (AoS) with thrust…

The key used for sorting is a member of the struct… How can I do this ? I can’t find anything in the doc or tutorials etc External Image

so I have a struct like that

[codebox]struct MyStruct

{

unsigned int myKey;

unsigned int otherStuff;

}[/codebox]

and I’d like to sort an array of these structs using myKey as a key…

Thanks in advance for helping me :)

You need to create a custom compare function that operates on the keys in your structure. Something like this:

template<typename T>

 struct dev_cmp_custom_key : public binary_function<T,T,bool>

{

  /*! Function call operator. The return value is <tt>lhs < rhs</tt>.

   */

  __host__ __device__ bool operator()(const T &lhs, const T &rhs) const

  {

	uint a=lhs.myKey;

	uint b=rhs.myKey;

	return (a < b);

  }

}; // end compare

And then call the sort function like this (where dArray is your array of structs) :

thrust::sort(dArray.begin(), dArray.end(), dev_cmp_custom_key<MyStruct>());

Jeroen already answered your question, but I wanted to point out that your performance will be much better if you separate the array of structures (AoS) into a structure of arrays (SoA) and then use thrust::sort_by_key() to perform a key-value sort.

Thank you everybody…
BTW… I’ve read here and there that SoA is faster than AoS. But does anybody have an explanation for that ? Just curious :P

In most cases, several memory transactions are required to read a struct from global memory into registers (or shared memory). In contrast, when you access an array of primitive types (float, int, etc.) the memory accesses are coalesced and much more efficient. Note that the built-in vector types like float2 and uint4 have some special specifiers (i.e. align) which allows them to be coalesced, but they are somewhat exceptional.

Converting AoS to SoA will often net you a sizable speedup (e.g. 2x or 3x) so I would strongly urge you to do it before considering any other optimizations. I wish I had some benchmarks on hand to illustrate the point, but IMO AoS really is CUDA public enemy #1.

Note that you can still use AoS-style programming with SoA-style data layout with thrust::zip_iterator. The zip_iterator “zips” values from several arrays together into tuples of values. Theres a brief discription of zip_iterator in the Introductory slides and online Tutorial. A complete example is also available.

Okaaay… so I implemented the comparison functor as told:

[codebox]//Custom compare function for sorting

template

struct dev_cmp_custom_key : public thrust::binary_function<T,T,bool>

{

/*! Function call operator. The return value is lhs < rhs.

*/

host device bool operator()(const T &lhs, const T &rhs) const

{

unsigned int a=lhs.mortonCode;

unsigned int b=rhs.mortonCode;

return (a < B);

}

}; // end compare

//[/codebox]

And then:

-First tried using my device_ptr directly for sorting :

[codebox]void LBVH_sort_by_code()

{

bvhnode_t*	rawb	= thrust::raw_pointer_cast(d_BVHNODE);

thrust::sort(d_BVHNODE, d_BVHNODE+(universeElementCount-1), dev_cmp_custom_key<bvhnode_t>());

}[/codebox]

This give me something like 53 errors like this:

[codebox]Error 2 error: no instance of overloaded function “dereference” matches the argument list c:\to52\gpuculler\ext\thrust\thrust\sorting\detail\device\cuda\stable_merge_sort.inl 325

[/codebox]

Then I tried using directly the raw device pointer (memory is okay, since I used the same raw pointer for another kernel, and then copied the data back to host for checking… everything’s ok). it compiles fine. But I get “Unhandled exception at 0x10007467 (gpuCuller-d.dll) in LibTest.exe: 0xC0000005: Access violation reading location 0x01000718.”

Here is the code I’m using:

[codebox]void LBVH_sort_by_code()

{

bvhnode_t*	rawb	= thrust::raw_pointer_cast(d_BVHNODE);

thrust::sort(rawb, rawb+(universeElementCount-1)*sizeof(bvhnode_t), dev_cmp_custom_key<bvhnode_t>());

}[/codebox]

Other infos :

Here is the declaration of bvhnode_t :

[codebox]//BVH Node

typedef struct bvhnode{

unsigned int primIndex;

unsigned int mortonCode;

float centroidX, centroidY, centroidZ;

} bvhnode_t;

//[/codebox]

Well, I have no idea about what I’m doing wrong External Image Any idea ?

Can you show us how d_BVHNODE is defined?

If d_BVHNODE is a device_vector then you’ll want to do

[codebox]

include <thrust/device_vector.h>

include <thrust/sort.h>

include <thrust/gather.h>

include <thrust/sequence.h>

include

include

//BVH Node

typedef struct bvhnode

{

unsigned int primIndex;

unsigned int mortonCode;

float centroidX, centroidY, centroidZ;

} bvhnode_t;

// extract a mortonCode from a bvhnode_t

struct bvhnode_to_mortronCode

{

__host__ __device__

unsigned int operator()(const bvhnode_t& node)

{

    return node.mortonCode;

}

};

void print_nodes(thrust::device_vector<bvhnode_t>& data)

{

for(size_t i = 0; i < data.size(); i++)

{

    bvhnode_t node = data[i];

    std::cout << "primIndex: " << node.primIndex << "  mortonCode: " << node.mortonCode << std::endl;

}

}

int main(void)

{

thrust::device_vector<bvhnode_t> data(10);

// initialize data

for(size_t i = 0; i < data.size(); i++)

{

    bvhnode_t node;

    node.primIndex  = i;

    node.mortonCode = rand() % 100;

    data[i] = node;

}

std::cout << “Before sort” << std::endl;

print_nodes(data);

// strip out the morton codes from each bvhnode

thrust::device_vector<unsigned int> codes(data.size());

thrust::transform(data.begin(), data.end(), codes.begin(), bvhnode_to_mortronCode());

// sort by the mortonCodes

thrust::sort_by_key(codes.begin(), codes.end(), data.begin());

std::cout << “After sort” << std::endl;

print_nodes(data);

return 0;

}

[/codebox]

Here’s some sample output

[codebox]

Before sort

primIndex: 0 mortonCode: 83

primIndex: 1 mortonCode: 86

primIndex: 2 mortonCode: 77

primIndex: 3 mortonCode: 15

primIndex: 4 mortonCode: 93

primIndex: 5 mortonCode: 35

primIndex: 6 mortonCode: 86

primIndex: 7 mortonCode: 92

primIndex: 8 mortonCode: 49

primIndex: 9 mortonCode: 21

After sort

primIndex: 3 mortonCode: 15

primIndex: 9 mortonCode: 21

primIndex: 5 mortonCode: 35

primIndex: 8 mortonCode: 49

primIndex: 2 mortonCode: 77

primIndex: 0 mortonCode: 83

primIndex: 1 mortonCode: 86

primIndex: 6 mortonCode: 86

primIndex: 7 mortonCode: 92

primIndex: 4 mortonCode: 93

[/codebox]

This code won’t be as fast as a pure-SoA approach, but it’s much much better than the AoS method.

Here is the code I used for mem allocation:

d_BVHNODE is a device_ptr

[codebox] //Prepare memory for BVH Nodes

bvhnode_t * bvhnode_raw_ptr; 

cudaMalloc((void **) &bvhnode_raw_ptr, N * sizeof(bvhnode_t));

d_BVHNODE = thrust::device_ptr<bvhnode_t>(bvhnode_raw_ptr);[/codebox]

I kept using AoS scheme for now, but I’ll need to switch to AoS sooner or later. The sooner the better i guess, but i just wanted to try sorting and get some results first… but :P