sorting 2d array in CUDA


I am a new to CUDA. I’d like to ask about sorting in CUDA.
Is it possible to sort arrays of primitive types in CUDA?

I have an array of an array of floats.


1st array: 1.0 2.0 0.5
2nd array: 1.0 1.1 1.0
3rd array: 3.2 2.0 0.1

I want to sort these arrays increasing in the last floats.


1st array: 3.2 2.0 0.1
2nd array: 1.0 2.0 0.5
3rd array: 1.0 1.1 1.0

Would this be possible in CUDA?
If so which algorithm can I use? Any suggestions?

Hope that makes sense,


Thrust can help you.

please check

Hi LSChien,

Thanks for your reply. It seems that I can make use this function of thrust which uses a comparator to determine the


template<typename RandomAccessKeyIterator,

     typename RandomAccessValueIterator,

     typename StrictWeakOrdering>

void stable_sort_by_key(RandomAccessKeyIterator keys_first,

                      RandomAccessKeyIterator keys_last,

                      RandomAccessValueIterator values_first,

                      StrictWeakOrdering comp);

Although I am using CUDA version 2.3 and it seems the requirement is CUDA 3.0 or higher.

Do you know of other library that perform similar function?

Many Thanks,

You can also create a simple index vector that stores the initial position of these array: Eg. ,

---------- 1st Col----2nd Col-----3rd Col----index_vector
1st array:- 1.0 ------ 2.0 ------- 0.5 ----- 0
2nd array:- 1.0 ------ 1.1 ------- 1.0 ----- 1
3rd array:- 3.2 ------ 2.0 ------- 0.1 ----- 2

Now, your 3rd cols and the index_vector forms a key-value pair that you can sort based on key values using CUDA SDK Radix sort example code. Finally

You will have:

3rd Col index_vector
0.1 -------- 2
0.5 -------- 0
1.0 -------- 1

In this, the index_vector gives you the final positions of the array’s 1, 2 and 3, This index_vector can be used in the subsequent code, inplace of disturbing the inital 2D array ( if used ) for storing 1st, 2nd and 3rd arrays.

Thanks! that worked!

I would also like to ask an unrelated question.

Say I have a stack that is allocated on the device… I want to perform the simple calculation below, sequential or otherwise.

d_ehvStack[iteration] = 1;

for (int i = 0; i < n; i++) {

  d_ehvStack[iteration] *= d_frontsArray[i];


This operation is not allowed in the host because the device memory can’t be accessed.

How do I perform this simple operation without copying back the array to host since this copy back to host is quite expensive.

Would I have to create a kernel function for each such operation that only require one thread? This seems tedious.

Thanks in Advance!

Take a look at the reduction sample in SDK. That does a add , reduce but your case seems to be a multiply reduce but can be done by little plumbing! :)