Hey,
I am using a Tesla M2075 on gentoo linux with nvidia-cuda-sdk-4.1 and nvidia-drivers-295.49.
I am kind of new to this and am trying to implement a fast sorting algorithm, as outlined on this page:
http://www.bealto.com/gpu-sorting_parallel-bitonic-2.html
I, however, cannot get to the performance, > 100m keys per second. I only get about 15-20M keys p/s.
I used NVVP and tried to do loop unrolling for avoiding replay overhead but it didn’t change the performance.
Here is a screenshot of NVVP with timeline and details graph: http://i.imgur.com/ldijF.png
Here is the cl source code of the kernels I am using: sorting kernels - Pastebin.com
UPDATE: How the kernels are started (code is from bealto implementation): bool sort(Context * c,int targetDevice,int n,cl_mem in,cl_mem out) const - Pastebin.com
UPDATE 2: The view from NVVP displays one sort using multiple specialized forms of a bitonic merge sorts
UPDATE 3: Details for the longest running kernel invocations: http://i.imgur.com/GyYaG.png
Best wishes
boundogre