Implementing a fast sorting algorithm -- how?

Hey,

I am using a Tesla M2075 on gentoo linux with nvidia-cuda-sdk-4.1 and nvidia-drivers-295.49.

I am kind of new to this and am trying to implement a fast sorting algorithm, as outlined on this page:

http://www.bealto.com/gpu-sorting_parallel-bitonic-2.html

I, however, cannot get to the performance, > 100m keys per second. I only get about 15-20M keys p/s.

I used NVVP and tried to do loop unrolling for avoiding replay overhead but it didn’t change the performance.

Here is a screenshot of NVVP with timeline and details graph: http://i.imgur.com/ldijF.png

Here is the cl source code of the kernels I am using: http://pastebin.com/uVtyin74

UPDATE: How the kernels are started (code is from bealto implementation): http://pastebin.com/Sx4Stfth

UPDATE 2: The view from NVVP displays one sort using multiple specialized forms of a bitonic merge sorts

UPDATE 3: Details for the longest running kernel invocations: http://i.imgur.com/GyYaG.png
Best wishes

boundogre

Bonus question: NVVP says in one of the analysis tabs that my multiprocessor usage/occupancy is low because I am using too many registers. Does anyone have an idea how to reduce the register usage in my algorithms? I read in the nvidia performance guide that it has to do with the amount of variables used but I am not sure how I could reduce this…

I have now tweaked the code for a while. It turns out that when I replace the type of my data typedef struct { uint64 f[FIELDS_PER_RECORD]; } record_t; '' with just typedef ulong record_t;’’ the performance of my sorter goes up to 70Mkeys/s. That’s quite an improvement.