Implementing a fast sorting algorithm -- how?


I am using a Tesla M2075 on gentoo linux with nvidia-cuda-sdk-4.1 and nvidia-drivers-295.49.

I am kind of new to this and am trying to implement a fast sorting algorithm, as outlined on this page:

I, however, cannot get to the performance, > 100m keys per second. I only get about 15-20M keys p/s.

I used NVVP and tried to do loop unrolling for avoiding replay overhead but it didn’t change the performance.

Here is a screenshot of NVVP with timeline and details graph:

Here is the cl source code of the kernels I am using:

UPDATE: How the kernels are started (code is from bealto implementation):

UPDATE 2: The view from NVVP displays one sort using multiple specialized forms of a bitonic merge sorts

UPDATE 3: Details for the longest running kernel invocations:
Best wishes


Bonus question: NVVP says in one of the analysis tabs that my multiprocessor usage/occupancy is low because I am using too many registers. Does anyone have an idea how to reduce the register usage in my algorithms? I read in the nvidia performance guide that it has to do with the amount of variables used but I am not sure how I could reduce this…

I have now tweaked the code for a while. It turns out that when I replace the type of my data typedef struct { uint64 f[FIELDS_PER_RECORD]; } record_t; '' with just typedef ulong record_t;’’ the performance of my sorter goes up to 70Mkeys/s. That’s quite an improvement.