I am using a Tesla M2075 on gentoo linux with nvidia-cuda-sdk-4.1 and nvidia-drivers-295.49.
I am kind of new to this and am trying to implement a fast sorting algorithm, as outlined on this page:
I, however, cannot get to the performance, > 100m keys per second. I only get about 15-20M keys p/s.
I used NVVP and tried to do loop unrolling for avoiding replay overhead but it didn’t change the performance.
Here is a screenshot of NVVP with timeline and details graph: http://i.imgur.com/ldijF.png
Here is the cl source code of the kernels I am using: http://pastebin.com/uVtyin74
UPDATE: How the kernels are started (code is from bealto implementation): http://pastebin.com/Sx4Stfth
UPDATE 2: The view from NVVP displays one sort using multiple specialized forms of a bitonic merge sorts
UPDATE 3: Details for the longest running kernel invocations: http://i.imgur.com/GyYaG.png