sort a matrix row wise thrust sort is slower on gpu than its cpu counterpart

hajisaib · February 3, 2012, 11:59pm

For an array of 100,000 items; using thrust::sort_by_key, gpu is 4 times faster than cpu.
But when i sort a matrix row-wise, then the gpu becomes too slow than cpu.

May be its too many call to the gpu equal to sqrt(number_of_times) for each row of the matrix.
i tried std::sort_by_key but doesn’t work; saying that namespace “std” has no member “sort_by_key”

the program is here:

#include <thrust/device_ptr.h>
#include <thrust/sort.h>
#include <thrust/gather.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/device_vector.h>
#include <time.h>
#include <sys/time.h>
#include <stdlib.h>

int myRand()
{
return rand()%1000;
}

int main(int argc, char** argv)
{

if (argc != 3) {
printf(“format: ./a.out [gpu = 0, cpu = 1] num_items\n”);
exit(0);
}

if (argv[1] == 0) {
printf(“cpu is selected\n”);
}
else {
printf(“gpu is selected\n”);
}

int select = atoi(argv[1]);

int N = atoi(argv[2]);

thrust::host_vector keys(N);
thrust::host_vector values(N);
thrust::generate(keys.begin(), keys.end(), myRand);

int cols = sqrt(N);

for (int i = 0; i < cols; i++) {
thrust::sequence(values.begin()+i*cols, values.begin()+(i+1)*cols);
}
thrust::device_vector d_keys = keys;
thrust::device_vector d_values = values;

cudaEventCreate(&start);
cudaEventCreate(&stop);
float elapsedTime;

if (select == 0) {
cudaEventRecord(start,0);
for (int i = 0; i < cols; i++) {
thrust::sort_by_key(d_keys.begin()+i*cols, d_keys.begin()+(i+1)cols, d_values.begin()+icols);
}
cudaEventRecord(stop,0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsedTime, start, stop);
printf(“gpu-time = %f msec\n”, elapsedTime);
}

else {
	cudaEventRecord(start,0);
	for (int i = 0 ; i < cols; i++) {
		thrust::sort_by_key(keys.begin()+i*cols, keys.begin()+(i+1)*cols, values.begin()+i*cols);
		}
	cudaEventRecord(stop,0);
	cudaEventSynchronize(stop);
	cudaEventElapsedTime(&elapsedTime, start, stop);
	printf("cpu-time = %f msec\n", elapsedTime);
	
}

thrust::copy(d_keys.begin(), d_keys.end(), keys.begin());
thrust::copy(d_values.begin(), d_values.end(), values.begin());
return 0;
}

Can someone figure out the remedy how to get thrust::sort_by_key working faster than its cpu counterpart.

short · February 4, 2012, 3:54am

Taking transpose and sorting might give better performance then?

I just wanted to point you to ArrayFire(free GPU library) as well. These basic operations can be performed just in one line using ArrayFire. Here is the code :

array A = randu(3,3); % Generate random numbers on GPU

array B = sort(A,0);  % Sorts along rows

array C = sort(A,1);  % Sort along columns

Output:

A =

        0.7402     0.9690     0.6673

        0.9210     0.9251     0.1099

        0.0390     0.4464     0.4702

B =

        0.0390     0.4464     0.1099

        0.7402     0.9251     0.4702

        0.9210     0.9690     0.6673

C =

        0.6673     0.7402     0.9690

        0.1099     0.9210     0.9251

        0.0390     0.4464     0.4702

hajisaib · February 4, 2012, 6:01pm

Taking transpose and sorting might give better performance then?

I just wanted to point you to ArrayFire(free GPU library) as well. These basic operations can be performed just in one line using ArrayFire. Here is the code :
array A = randu(3,3); % Generate random numbers on GPU

array B = sort(A,0);  % Sorts along rows

array C = sort(A,1);  % Sort along columns
Output:
A =

        0.7402     0.9690     0.6673

        0.9210     0.9251     0.1099

        0.0390     0.4464     0.4702

B =

        0.0390     0.4464     0.1099

        0.7402     0.9251     0.4702

        0.9210     0.9690     0.6673

C =

        0.6673     0.7402     0.9690

        0.1099     0.9210     0.9251

        0.0390     0.4464     0.4702

Thank you so much. It is helpful, but how about sorted indices? how to get them? any idea…

short · February 4, 2012, 8:22pm

In ArrayFire, this is really easy.

array data = randu(3,4);

array sorted, idx;

sort(sorted,idx, data);

// Print output

print(sorted);

print(idx);

print(data);

Output:

data =

        0.7402     0.9690     0.6673     0.5132

        0.9210     0.9251     0.1099     0.7762

        0.0390     0.4464     0.4702     0.2948

sorted =

        0.0390     0.4464     0.1099     0.2948

        0.7402     0.9251     0.4702     0.5132

        0.9210     0.9690     0.6673     0.7762

idx =

        2.0000     2.0000     1.0000     2.0000

        0.0000     1.0000     2.0000     0.0000

        1.0000     0.0000     0.0000     1.0000

hajisaib · February 4, 2012, 9:35pm

How can i come up with the corresponding CPU sorting code??

by the way, i know how to time the code using timer::tic() and timer::toc()

Thanks man!!

In ArrayFire, this is really easy.

array data = randu(3,4);

array sorted, idx;

sort(sorted,idx, data);

// Print output

print(sorted);

print(idx);

print(data);

Output:

data =

        0.7402     0.9690     0.6673     0.5132

        0.9210     0.9251     0.1099     0.7762

        0.0390     0.4464     0.4702     0.2948

sorted =

        0.0390     0.4464     0.1099     0.2948

        0.7402     0.9251     0.4702     0.5132

        0.9210     0.9690     0.6673     0.7762

idx =

        2.0000     2.0000     1.0000     2.0000

        0.0000     1.0000     2.0000     0.0000

        1.0000     0.0000     0.0000     1.0000