sort a matrix row wise thrust sort is slower on gpu than its cpu counterpart

For an array of 100,000 items; using thrust::sort_by_key, gpu is 4 times faster than cpu.
But when i sort a matrix row-wise, then the gpu becomes too slow than cpu.

May be its too many call to the gpu equal to sqrt(number_of_times) for each row of the matrix.
i tried std::sort_by_key but doesn’t work; saying that namespace “std” has no member “sort_by_key”

the program is here:

#include <thrust/device_ptr.h>
#include <thrust/sort.h>
#include <thrust/gather.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/device_vector.h>
#include <time.h>
#include <sys/time.h>
#include <stdlib.h>

int myRand()
{
return rand()%1000;
}

int main(int argc, char** argv)
{

if (argc != 3) {
printf(“format: ./a.out [gpu = 0, cpu = 1] num_items\n”);
exit(0);
}

if (argv[1] == 0) {
printf(“cpu is selected\n”);
}
else {
printf(“gpu is selected\n”);
}

int select = atoi(argv[1]);

int N = atoi(argv[2]);

thrust::host_vector keys(N);
thrust::host_vector values(N);
thrust::generate(keys.begin(), keys.end(), myRand);

int cols = sqrt(N);

for (int i = 0; i < cols; i++) {
thrust::sequence(values.begin()+i*cols, values.begin()+(i+1)*cols);
}
thrust::device_vector d_keys = keys;
thrust::device_vector d_values = values;

cudaEventCreate(&start);
cudaEventCreate(&stop);
float elapsedTime;

if (select == 0) {
cudaEventRecord(start,0);
for (int i = 0; i < cols; i++) {
thrust::sort_by_key(d_keys.begin()+i*cols, d_keys.begin()+(i+1)cols, d_values.begin()+icols);
}
cudaEventRecord(stop,0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsedTime, start, stop);
printf(“gpu-time = %f msec\n”, elapsedTime);
}

else {
	cudaEventRecord(start,0);
	for (int i = 0 ; i < cols; i++) {
		thrust::sort_by_key(keys.begin()+i*cols, keys.begin()+(i+1)*cols, values.begin()+i*cols);
		}
	cudaEventRecord(stop,0);
	cudaEventSynchronize(stop);
	cudaEventElapsedTime(&elapsedTime, start, stop);
	printf("cpu-time = %f msec\n", elapsedTime);
	
}

thrust::copy(d_keys.begin(), d_keys.end(), keys.begin());
thrust::copy(d_values.begin(), d_values.end(), values.begin());
return 0;
}

Can someone figure out the remedy how to get thrust::sort_by_key working faster than its cpu counterpart.

Taking transpose and sorting might give better performance then?

I just wanted to point you to ArrayFire(free GPU library) as well. These basic operations can be performed just in one line using ArrayFire. Here is the code :

array A = randu(3,3); % Generate random numbers on GPU

array B = sort(A,0);  % Sorts along rows

array C = sort(A,1);  % Sort along columns

Output:

A =

        0.7402     0.9690     0.6673

        0.9210     0.9251     0.1099

        0.0390     0.4464     0.4702

B =

        0.0390     0.4464     0.1099

        0.7402     0.9251     0.4702

        0.9210     0.9690     0.6673

C =

        0.6673     0.7402     0.9690

        0.1099     0.9210     0.9251

        0.0390     0.4464     0.4702

Thank you so much. It is helpful, but how about sorted indices? how to get them? any idea…

In ArrayFire, this is really easy.

array data = randu(3,4);

array sorted, idx;

sort(sorted,idx, data);

// Print output

print(sorted);

print(idx);

print(data);

Output:

data =

        0.7402     0.9690     0.6673     0.5132

        0.9210     0.9251     0.1099     0.7762

        0.0390     0.4464     0.4702     0.2948

sorted =

        0.0390     0.4464     0.1099     0.2948

        0.7402     0.9251     0.4702     0.5132

        0.9210     0.9690     0.6673     0.7762

idx =

        2.0000     2.0000     1.0000     2.0000

        0.0000     1.0000     2.0000     0.0000

        1.0000     0.0000     0.0000     1.0000

How can i come up with the corresponding CPU sorting code??

by the way, i know how to time the code using timer::tic() and timer::toc()

Thanks man!!