Lexicographic Sorting - Poor performance

p_vs_np · June 30, 2010, 11:10am

I’m trying to develop a sorting operation on a large data-set as an intermediary step to a larger application.

Right now, I have with me the following specifics:

A 1-D character array Words which stores the words in a serial order, i.e, stores one word after another in contiguous memory locations.
Each word in the character array is WIDTH_WORD wide long, where WIDTH_WORD is typically around 400.
I have an integer array, ptr such that ptr[i] stores the offset of the beginning of the i-th word in the Words array.
There are a total of n such words where n is typically in the range 10^4.

In order to do an efficient lexicographic sort on the array Words, I am currently using a rather poor self-designed version of the Bitonic Sort.
Right now, there is high inefficiency in sorting the elements as these are some of the key steps i’m following at the moment:

There are two kernels - one doing the Even sort and the other an Odd sort
Launch the kernel with an aim of letting each thread handle two words in the Words array.
Once launched, the thread makes a global read for ptr[2*threadId] and ptr[2*threadId + 1]. There’s a lot of scattered reads in doing this.
Once the ptr[] elements have been read, call the respective characters of words pointed by them and do character-by-character comparison.
This too involves a lot of non-coalesced global reads.
Once the words’ respective positions are decided after making the character-wise comparison, each thread reflects the changes back into the ptr[] array. Again, lots of scattered writes.

Repeat this procedure till all elements are sorted.
Typically, these two kernels , Odd-sort and Even-sort are launched around (WIDTH_WORD x 10^4) = 10^6.

And this is taking a LOT of time to compute. The CPU, in fact, is able to do better. It’s taking at least half-an-hour to churn the data, at the moment :">

Any suggestions for a design-improvement?

PS: I’ve read Satish et al.'s paper on the efficient Radix Sort method. But, the thing is, I simply am not able to port my current design to their algorithm as I foresee similar scattered global-memory reads if done. Moreover, their algo. deals with things at the bit-level, which, quite frankly, is hard for me to code right now.

Any suggestions? External Image

seibert · June 30, 2010, 4:23pm

Wow, this sounds like an application the compute 2.0 caches would love. Can you find a GTX 400-series card to try this on, just to see?

seibert · June 30, 2010, 4:23pm

Wow, this sounds like an application the compute 2.0 caches would love. Can you find a GTX 400-series card to try this on, just to see?

p_vs_np · June 30, 2010, 5:43pm

I’m working on a Quadro 3700 at the moment. Is that 2.0 compatible?

It’s equivalent to GTX 8800. So, i think, it ought to be 2.0 compatible.

And if so, thats the performance i’m getting as of now! A half-an-hour to do the above mentioned sort!

Unfortunately, that’s the best card I have in my labs for CUDA-intensive work.

In any case, do you see any obvious design changes which could help improve the performance?

p_vs_np · June 30, 2010, 5:43pm

I’m working on a Quadro 3700 at the moment. Is that 2.0 compatible?

It’s equivalent to GTX 8800. So, i think, it ought to be 2.0 compatible.

And if so, thats the performance i’m getting as of now! A half-an-hour to do the above mentioned sort!

Unfortunately, that’s the best card I have in my labs for CUDA-intensive work.

In any case, do you see any obvious design changes which could help improve the performance?

seibert · June 30, 2010, 5:47pm

No, sadly a Quadro 3700 is a compute 1.1 device, which means the penalty for uncoalesced reads is large. Assuming you could run on a GTX 4xx or Tesla C2050/2070 (which are compute 2.0 devices), you would not need to do anything special to benefit from the cache.

seibert · June 30, 2010, 5:47pm

No, sadly a Quadro 3700 is a compute 1.1 device, which means the penalty for uncoalesced reads is large. Assuming you could run on a GTX 4xx or Tesla C2050/2070 (which are compute 2.0 devices), you would not need to do anything special to benefit from the cache.

p_vs_np · June 30, 2010, 5:56pm

Oh. That’s sad :mellow:

Well, given I won’t be able to access a Tesla (for sure), is there any other design change which I may consider, given my present design?

Quite frankly, I’ve blown out all the little transistors in my head trying to optimize the thing. And all in vain, so far External Media

It’s quite heart-breaking to see a CPU outdo your GPU code, you see. Any suggestions would brighten my day…

p_vs_np · June 30, 2010, 5:56pm

Oh. That’s sad :mellow:

Well, given I won’t be able to access a Tesla (for sure), is there any other design change which I may consider, given my present design?

Quite frankly, I’ve blown out all the little transistors in my head trying to optimize the thing. And all in vain, so far External Media

It’s quite heart-breaking to see a CPU outdo your GPU code, you see. Any suggestions would brighten my day…

seibert · June 30, 2010, 6:14pm

I don’t have any good ideas. Sorting large, variable size elements is a tough problem.

seibert · June 30, 2010, 6:14pm

I don’t have any good ideas. Sorting large, variable size elements is a tough problem.

p_vs_np · June 30, 2010, 6:49pm

Yes. Apparently, it is hard.

Umm, have you gone through the paper by Satish et al. titled “Designing Effcient Sorting Algorithms for Manycore GPUs”?

If so, do you see how I may incorporate the logic presented there?

Somehow, I feel totally disconnected about the logic presented there, which seems so very implementable and my present conundrum here.

p_vs_np · June 30, 2010, 6:49pm

Yes. Apparently, it is hard.

Umm, have you gone through the paper by Satish et al. titled “Designing Effcient Sorting Algorithms for Manycore GPUs”?

If so, do you see how I may incorporate the logic presented there?

Somehow, I feel totally disconnected about the logic presented there, which seems so very implementable and my present conundrum here.

JaredHoberock · June 30, 2010, 6:55pm

You might look into the approach this programmer took to sort strings with Thrust:

[url=“http://ldn.linuxfoundation.org/article/c-gpu-and-thrust-strings-gpu”]http://ldn.linuxfoundation.org/article/c-g...ust-strings-gpu[/url]

JaredHoberock · June 30, 2010, 6:55pm

You might look into the approach this programmer took to sort strings with Thrust:

[url=“http://ldn.linuxfoundation.org/article/c-gpu-and-thrust-strings-gpu”]http://ldn.linuxfoundation.org/article/c-g...ust-strings-gpu[/url]

p_vs_np · June 30, 2010, 6:56pm

And yes, would it be possible to for you to test my code out on a Tesla, if you have one lying around you? :mellow:

p_vs_np · June 30, 2010, 6:56pm

And yes, would it be possible to for you to test my code out on a Tesla, if you have one lying around you? :mellow:

cbuchner1 · June 30, 2010, 9:37pm

Could you store the words in a large 1D texture (of type uchar) maybe? Then you get the benefits of the texture cache.
You’d have to store X offsets in the texture then instead of pointers to the actual word locations.

Are you currently swapping the words in memory, or just swapping the pointers to them?

cbuchner1 · June 30, 2010, 9:37pm

Could you store the words in a large 1D texture (of type uchar) maybe? Then you get the benefits of the texture cache.
You’d have to store X offsets in the texture then instead of pointers to the actual word locations.

Are you currently swapping the words in memory, or just swapping the pointers to them?

seibert · June 30, 2010, 9:38pm

Yes, I have GTX 470 downstairs. If your code is self-contained and will compile in Linux, I’m more than happy to try it out.

Topic		Replies	Views
Code: efficient sorting with CUDA CUDA Programming and Performance	40	62041	December 20, 2011
SRTS Radix Sort: High Performance and Scalable GPU Radix Sorting CUDA Programming and Performance	58	19397	May 27, 2013
Sorting in CUDA Is sorting in CUDA worth the trouble? CUDA Programming and Performance	15	8365	September 30, 2009
Fastest sorting algorithm on GPU currently CUDA Programming and Performance	20	46139	May 2, 2021
Sorting bottleneck in rendering fractals a call for ideas CUDA Programming and Performance	19	3973	November 18, 2009
bucket sort on GPU Any ideas? CUDA Programming and Performance	25	19695	November 15, 2020
My CUDA program fails for larger grids CUDA Programming and Performance	11	1618	October 11, 2010
Faulty sort algorithm. Please help! Odd even sort CUDA Programming and Performance	9	11652	July 26, 2008
Sort on GPU Need some help to use sorts... CUDA Programming and Performance	15	42108	June 19, 2008
string manipulation CUDA Programming and Performance	3	11161	December 10, 2008

Lexicographic Sorting - Poor performance

Related topics