[HELP] Why thrust is soooo slow?

Hello everyone,
I am a newbie to Cuda and I have a project to work on. The project is to find the k nearest neighbours of all objects, but thrust is too slow in sorting the neighbours, I dunno why. I might have something wrong with the code. Could you please help me? I meant by too slow is my CPU is faster in sorting than using thrust. I have Geforce GT 740.

here’s launch: I’m launching the kernel to find the kNN of object j

for (int j = 0; j < numberOfObjects; j++){
					kNNQueryInGPU << <blocksPerGrid, threadsPerBlock >> >(x_d, y_d, list1_d, list2_d, j);
					cudaMemcpy(&list1, list1_d, sizeof(int) * numberOfObjects, cudaMemcpyDeviceToHost);
					cudaMemcpy(&list2, list2_d, sizeof(int) * numberOfObjects, cudaMemcpyDeviceToHost);

					thrust::sort_by_key(list1, list1 + numberOfObjects, list2);
					getKNNeighbour(list1, j);
				}

here’s the kernel

__global__ void kNNQueryInGPU(int *x, int *y, int *list1, int* list2, int index){
	int tid = threadIdx.x + blockDim.x * blockIdx.x;
	if (tid < numberOfObjects){

		list1[tid] = sqrt(pow((double)(x[index] - x[tid]), 2) + pow((double)(y[index] - y[tid]), 2));
		list2[tid] = threadIdx.x + blockDim.x * blockIdx.x;
	}//ends while-loop
}

You appear to be sorting the data on the host.

Thrust will probably do the sort quicker if you sort the data on the device:

for (int j = 0; j < numberOfObjects; j++){
					kNNQueryInGPU << <blocksPerGrid, threadsPerBlock >> >(x_d, y_d, list1_d, list2_d, j);
					thrust::sort_by_key(thrust::device, list1_d, list1_d + numberOfObjects, list2_d);
					cudaMemcpy(&list1, list1_d, sizeof(int) * numberOfObjects, cudaMemcpyDeviceToHost);
					cudaMemcpy(&list2, list2_d, sizeof(int) * numberOfObjects, cudaMemcpyDeviceToHost);


					getKNNeighbour(list1, j);
				}

thanks for your replay, I’m getting this new error “no instance of overloaded function matches the argument list”, the first argument is mismatched.

You’ll need to add:

#include <thrust/execution_policy.h>

to your code.

thank you a lot, I’ll try it. :)

[s]It is still too slow… :(

I’m running it with 20,000 objects only…

it’s taking over 5 mins[/s]

I am sorry there was an exception I didn’t notice xD

wtih 10,000 objects
CPU time: 13sec
GPU time: 40sec

that’s soo slow,
what’s the problem? :(


it’s faster when I solve the problem using GPU and sort it in CPU with merge sort…

Are you building a debug project? Is the -G switch specified on your compile command line? If so try building a release project without the -G switch.

I tried creating a complete code around the code you have shown and it took about 6 seconds. But if I compile with -G it takes much longer:

$ cat t751.cu
#include <stdio.h>
#include <thrust/sort.h>
#include <thrust/execution_policy.h>
#include <math.h>

  const int numberOfObjects = 10000;
__global__ void kNNQueryInGPU(int *x, int *y, int *list1, int* list2, int index){
        int tid = threadIdx.x + blockDim.x * blockIdx.x;
        if (tid < numberOfObjects){

                list1[tid] = sqrt(pow((double)(x[index] - x[tid]), 2) + pow((double)(y[index] - y[tid]), 2));
                list2[tid] = threadIdx.x + blockDim.x * blockIdx.x;
        }//ends while-loop
}

int getKNNeighbour(int *data, int index){
  return data[index];
}

int main(){

  const int threadsPerBlock = 256;
  const int blocksPerGrid = (numberOfObjects+threadsPerBlock-1)/threadsPerBlock;
  int *x_d, *y_d, *list1_d, *list2_d, *list1, *list2;
  size_t msize = numberOfObjects*sizeof(int);
  list1 = (int *)malloc(numberOfObjects * msize);
  list2 = (int *)malloc(numberOfObjects * msize);
  cudaMalloc(&x_d, msize);
  cudaMalloc(&y_d, msize);
  cudaMalloc(&list1_d, msize);
  cudaMalloc(&list2_d, msize);
  for (int j = 0; j < numberOfObjects; j++){
    kNNQueryInGPU << <blocksPerGrid, threadsPerBlock >> >(x_d, y_d, list1_d, list2_d, j);
    thrust::sort_by_key(thrust::device, list1_d, list1_d + numberOfObjects, list2_d);
    cudaMemcpy(list1, list1_d, sizeof(int) * numberOfObjects, cudaMemcpyDeviceToHost);
    cudaMemcpy(list2, list2_d, sizeof(int) * numberOfObjects, cudaMemcpyDeviceToHost);
    getKNNeighbour(list1, j);
    }
  return 0;
}
$ nvcc -O3 -o t751 t751.cu
$ time ./t751

real    0m6.205s
user    0m2.845s
sys     0m3.190s
$ nvcc -G -o t751 t751.cu
$ time ./t751

real    2m20.484s
user    1m21.750s
sys     0m58.487s
$

If you want help, it’s going to be more productive if you create a simple, complete test case, along the lines of what I am showing here.

it was in debug mode, I’ve switched it to release mode and it is running fast…

thank you a lot for helping me :)