Thrust sort Need some help

Buanderie · October 20, 2009, 8:25am

Hello,

I would like to sort an a device array of struct (AoS) with thrust…

The key used for sorting is a member of the struct… How can I do this ? I can’t find anything in the doc or tutorials etc External Image

so I have a struct like that

[codebox]struct MyStruct

{

unsigned int myKey;

unsigned int otherStuff;

}[/codebox]

and I’d like to sort an array of these structs using myKey as a key…

Thanks in advance for helping me :)

Jeroen · October 20, 2009, 9:27am

You need to create a custom compare function that operates on the keys in your structure. Something like this:

template<typename T>

 struct dev_cmp_custom_key : public binary_function<T,T,bool>

{

  /*! Function call operator. The return value is <tt>lhs < rhs</tt>.

   */

  __host__ __device__ bool operator()(const T &lhs, const T &rhs) const

  {

	uint a=lhs.myKey;

	uint b=rhs.myKey;

	return (a < b);

  }

}; // end compare

And then call the sort function like this (where dArray is your array of structs) :

thrust::sort(dArray.begin(), dArray.end(), dev_cmp_custom_key<MyStruct>());

nbell · October 20, 2009, 10:34am

Hello,

I would like to sort an a device array of struct (AoS) with thrust…

The key used for sorting is a member of the struct… How can I do this ? I can’t find anything in the doc or tutorials etc External Media

so I have a struct like that

[codebox]struct MyStruct

{
unsigned int myKey;

unsigned int otherStuff;
}[/codebox]

and I’d like to sort an array of these structs using myKey as a key…

Thanks in advance for helping me :)

Jeroen already answered your question, but I wanted to point out that your performance will be much better if you separate the array of structures (AoS) into a structure of arrays (SoA) and then use thrust::sort_by_key() to perform a key-value sort.

Buanderie · October 21, 2009, 7:54am

Thank you everybody…
BTW… I’ve read here and there that SoA is faster than AoS. But does anybody have an explanation for that ? Just curious :P

nbell · October 21, 2009, 12:56pm

In most cases, several memory transactions are required to read a struct from global memory into registers (or shared memory). In contrast, when you access an array of primitive types (float, int, etc.) the memory accesses are coalesced and much more efficient. Note that the built-in vector types like float2 and uint4 have some special specifiers (i.e. align) which allows them to be coalesced, but they are somewhat exceptional.

Converting AoS to SoA will often net you a sizable speedup (e.g. 2x or 3x) so I would strongly urge you to do it before considering any other optimizations. I wish I had some benchmarks on hand to illustrate the point, but IMO AoS really is CUDA public enemy #1.

Note that you can still use AoS-style programming with SoA-style data layout with thrust::zip_iterator. The zip_iterator “zips” values from several arrays together into tuples of values. Theres a brief discription of zip_iterator in the Introductory slides and online Tutorial. A complete example is also available.

Buanderie · October 21, 2009, 8:07pm

Okaaay… so I implemented the comparison functor as told:

[codebox]//Custom compare function for sorting

template

struct dev_cmp_custom_key : public thrust::binary_function<T,T,bool>

{

/*! Function call operator. The return value is lhs < rhs.

*/

host device bool operator()(const T &lhs, const T &rhs) const

{

unsigned int a=lhs.mortonCode;

unsigned int b=rhs.mortonCode;

return (a < B);

}

}; // end compare

//[/codebox]

And then:

-First tried using my device_ptr directly for sorting :

[codebox]void LBVH_sort_by_code()

{

bvhnode_t*	rawb	= thrust::raw_pointer_cast(d_BVHNODE);

thrust::sort(d_BVHNODE, d_BVHNODE+(universeElementCount-1), dev_cmp_custom_key<bvhnode_t>());

}[/codebox]

This give me something like 53 errors like this:

[codebox]Error 2 error: no instance of overloaded function “dereference” matches the argument list c:\to52\gpuculler\ext\thrust\thrust\sorting\detail\device\cuda\stable_merge_sort.inl 325

[/codebox]

Then I tried using directly the raw device pointer (memory is okay, since I used the same raw pointer for another kernel, and then copied the data back to host for checking… everything’s ok). it compiles fine. But I get “Unhandled exception at 0x10007467 (gpuCuller-d.dll) in LibTest.exe: 0xC0000005: Access violation reading location 0x01000718.”

Here is the code I’m using:

[codebox]void LBVH_sort_by_code()

{

bvhnode_t*	rawb	= thrust::raw_pointer_cast(d_BVHNODE);

thrust::sort(rawb, rawb+(universeElementCount-1)*sizeof(bvhnode_t), dev_cmp_custom_key<bvhnode_t>());

}[/codebox]

Other infos :

Here is the declaration of bvhnode_t :

[codebox]//BVH Node

typedef struct bvhnode{

unsigned int primIndex;

unsigned int mortonCode;

float centroidX, centroidY, centroidZ;

} bvhnode_t;

//[/codebox]

Well, I have no idea about what I’m doing wrong External Image Any idea ?

nbell · October 21, 2009, 9:12pm

And then:

-First tried using my device_ptr directly for sorting :

[codebox]void LBVH_sort_by_code()

{
bvhnode_t*	rawb	= thrust::raw_pointer_cast(d_BVHNODE);

thrust::sort(d_BVHNODE, d_BVHNODE+(universeElementCount-1), dev_cmp_custom_key<bvhnode_t>());
}[/codebox]

Can you show us how d_BVHNODE is defined?

If d_BVHNODE is a device_vector then you’ll want to do

[codebox]

include <thrust/device_vector.h>

include <thrust/sort.h>

include <thrust/gather.h>

include <thrust/sequence.h>

include

//BVH Node

typedef struct bvhnode

{

unsigned int primIndex;

unsigned int mortonCode;

float centroidX, centroidY, centroidZ;

} bvhnode_t;

// extract a mortonCode from a bvhnode_t

struct bvhnode_to_mortronCode

{

__host__ __device__

unsigned int operator()(const bvhnode_t& node)

{

    return node.mortonCode;

}

};

void print_nodes(thrust::device_vector<bvhnode_t>& data)

{

for(size_t i = 0; i < data.size(); i++)

{

    bvhnode_t node = data[i];

    std::cout << "primIndex: " << node.primIndex << "  mortonCode: " << node.mortonCode << std::endl;

}

}

int main(void)

{

thrust::device_vector<bvhnode_t> data(10);

// initialize data

for(size_t i = 0; i < data.size(); i++)

{

    bvhnode_t node;

    node.primIndex  = i;

    node.mortonCode = rand() % 100;

    data[i] = node;

}

std::cout << “Before sort” << std::endl;

print_nodes(data);

// strip out the morton codes from each bvhnode

thrust::device_vector<unsigned int> codes(data.size());

thrust::transform(data.begin(), data.end(), codes.begin(), bvhnode_to_mortronCode());

// sort by the mortonCodes

thrust::sort_by_key(codes.begin(), codes.end(), data.begin());

std::cout << “After sort” << std::endl;

print_nodes(data);

return 0;

}

[/codebox]

Here’s some sample output

[codebox]

Before sort

primIndex: 0 mortonCode: 83

primIndex: 1 mortonCode: 86

primIndex: 2 mortonCode: 77

primIndex: 3 mortonCode: 15

primIndex: 4 mortonCode: 93

primIndex: 5 mortonCode: 35

primIndex: 6 mortonCode: 86

primIndex: 7 mortonCode: 92

primIndex: 8 mortonCode: 49

primIndex: 9 mortonCode: 21

After sort

primIndex: 3 mortonCode: 15

primIndex: 9 mortonCode: 21

primIndex: 5 mortonCode: 35

primIndex: 8 mortonCode: 49

primIndex: 2 mortonCode: 77

primIndex: 0 mortonCode: 83

primIndex: 1 mortonCode: 86

primIndex: 6 mortonCode: 86

primIndex: 7 mortonCode: 92

primIndex: 4 mortonCode: 93

[/codebox]

This code won’t be as fast as a pure-SoA approach, but it’s much much better than the AoS method.

Buanderie · October 21, 2009, 9:27pm

Here is the code I used for mem allocation:

d_BVHNODE is a device_ptr

[codebox] //Prepare memory for BVH Nodes

bvhnode_t * bvhnode_raw_ptr; 

cudaMalloc((void **) &bvhnode_raw_ptr, N * sizeof(bvhnode_t));

d_BVHNODE = thrust::device_ptr<bvhnode_t>(bvhnode_raw_ptr);[/codebox]

I kept using AoS scheme for now, but I’ll need to switch to AoS sooner or later. The sooner the better i guess, but i just wanted to try sorting and get some results first… but :P

Topic		Replies	Views
How to efficiently sort 5 arrays of integers? CUDA Programming and Performance	7	1203	June 19, 2015
Partitioning an array of n-tuples CUDA Programming and Performance	8	1929	July 14, 2017
Sorting in CUDA Is sorting in CUDA worth the trouble? CUDA Programming and Performance	15	8129	September 30, 2009
sorting data already in GPU? thrust maybe? CUDA Programming and Performance	5	1054	July 6, 2011
Thrust v1.0 release A high-level C++ template library for CUDA CUDA Programming and Performance	11	16782	May 30, 2009
Need help understanding thrust::scatter! CUDA Programming and Performance	9	3346	May 6, 2015
CUDA 8 - Thrust bug(?) CUDA Programming and Performance	16	4140	October 2, 2016
Is it possible to somehow fuse thrust::partition and thrust::sort? CUDA Programming and Performance	6	1632	September 11, 2016
Performance of memory coalescing in CUDA using Thrust: structure of array vs. array of structure CUDA Programming and Performance	0	775	August 29, 2018
sort a matrix row wise thrust sort is slower on gpu than its cpu counterpart CUDA Programming and Performance	4	1919	February 4, 2012

Thrust sort Need some help

Related topics