Implementing a fast sorting algorithm -- how?

Hey,

I am using a Tesla M2075 on gentoo linux with nvidia-cuda-sdk-4.1 and nvidia-drivers-295.49.

I am kind of new to this and am trying to implement a fast sorting algorithm, as outlined on this page:

http://www.bealto.com/gpu-sorting_parallel-bitonic-2.html

I, however, cannot get to the performance, > 100m keys per second. I only get about 15-20M keys p/s.

I used NVVP and tried to do loop unrolling for avoiding replay overhead but it didn’t change the performance.

Here is a screenshot of NVVP with timeline and details graph: http://i.imgur.com/ldijF.png

Here is the cl source code of the kernels I am using: sorting kernels - Pastebin.com

UPDATE: How the kernels are started (code is from bealto implementation): bool sort(Context * c,int targetDevice,int n,cl_mem in,cl_mem out) const - Pastebin.com

UPDATE 2: The view from NVVP displays one sort using multiple specialized forms of a bitonic merge sorts

UPDATE 3: Details for the longest running kernel invocations: http://i.imgur.com/GyYaG.png
Best wishes

boundogre

Bonus question: NVVP says in one of the analysis tabs that my multiprocessor usage/occupancy is low because I am using too many registers. Does anyone have an idea how to reduce the register usage in my algorithms? I read in the nvidia performance guide that it has to do with the amount of variables used but I am not sure how I could reduce this…

I have now tweaked the code for a while. It turns out that when I replace the type of my data typedef struct { uint64 f[FIELDS_PER_RECORD]; } record_t; '' with just typedef ulong record_t;‘’ the performance of my sorter goes up to 70Mkeys/s. That’s quite an improvement.