Question about sort

Hello,

I met this wired problem while doing sorting.

I am trying to sort multi-vectors, could be hundreds or even more, each vector has length 65536. After sorting each vector with response, i am doing something. Here is the code I use

    int *re;
CUDA_SAFE_CALL(cudaMalloc((void**) &re,N*sizeof(int)));
CUDA_SAFE_CALL(cudaMemcpy(re,response,N*sizeof(int),cudaMemc

pyHostToDevice));

int *x_cuda;
CUDA_SAFE_CALL(cudaMalloc((void**) &x_cuda,N*var*sizeof(int)));
CUDA_SAFE_CALL(cudaMemcpy(x_cuda,x,N*var*sizeof(int),cudaMem

cpyHostToDevice));

int *xx_cuda;
CUDA_SAFE_CALL(cudaMalloc((void**) &xx_cuda,N*sizeof(int)));

for(int i = 0; i < var; i++){

	initialize_x<<<N/BLOCK_SIZE,BLOCK_SIZE>>>(xx_cuda,x_cuda,N,i);
            CUDA_SAFE_CALL(cudaMemcpy(re,response,N*sizeof(int),cudaMemc

pyHostToDevice));
for (int k = 2; k <= N; k <<= 1){
for (int j=k>>1; j>0; j=j>>1){
BitonicSort1<<<N/BLOCK_SIZE, BLOCK_SIZE>>>(xx_cuda, re, j, k);
}
}

            ...............................

}

The problem is, when the number of vectors is less than 8, the run time is significantly less than the cpu algorithms, about 1/5 in runtime. When it’s more than 8, then it becomes really slow, much slower than the cpu algorithms.

Any idea why it is happening? Any help is appreciated.