Lexicographic Sorting - Poor performance

I’m trying to develop a sorting operation on a large data-set as an intermediary step to a larger application.

Right now, I have with me the following specifics:

  • A 1-D character array Words which stores the words in a serial order, i.e, stores one word after another in contiguous memory locations.
  • Each word in the character array is WIDTH_WORD wide long, where WIDTH_WORD is typically around 400.
  • I have an integer array, ptr such that ptr[i] stores the offset of the beginning of the i-th word in the Words array.
  • There are a total of n such words where n is typically in the range 10^4.

In order to do an efficient lexicographic sort on the array Words, I am currently using a rather poor self-designed version of the Bitonic Sort.
Right now, there is high inefficiency in sorting the elements as these are some of the key steps i’m following at the moment:

  1. There are two kernels - one doing the Even sort and the other an Odd sort
  2. Launch the kernel with an aim of letting each thread handle two words in the Words array.
  3. Once launched, the thread makes a global read for ptr[2*threadId] and ptr[2*threadId + 1]. There’s a lot of scattered reads in doing this.
  4. Once the ptr[] elements have been read, call the respective characters of words pointed by them and do character-by-character comparison.
    This too involves a lot of non-coalesced global reads.
  5. Once the words’ respective positions are decided after making the character-wise comparison, each thread reflects the changes back into the ptr[] array. Again, lots of scattered writes.

Repeat this procedure till all elements are sorted.
Typically, these two kernels , Odd-sort and Even-sort are launched around (WIDTH_WORD x 10^4) = 10^6.

And this is taking a LOT of time to compute. The CPU, in fact, is able to do better. It’s taking at least half-an-hour to churn the data, at the moment :">

Any suggestions for a design-improvement?

PS: I’ve read Satish et al.'s paper on the efficient Radix Sort method. But, the thing is, I simply am not able to port my current design to their algorithm as I foresee similar scattered global-memory reads if done. Moreover, their algo. deals with things at the bit-level, which, quite frankly, is hard for me to code right now.

Any suggestions? External Image

Wow, this sounds like an application the compute 2.0 caches would love. Can you find a GTX 400-series card to try this on, just to see?

Wow, this sounds like an application the compute 2.0 caches would love. Can you find a GTX 400-series card to try this on, just to see?

I’m working on a Quadro 3700 at the moment. Is that 2.0 compatible?

It’s equivalent to GTX 8800. So, i think, it ought to be 2.0 compatible.

And if so, thats the performance i’m getting as of now! A half-an-hour to do the above mentioned sort!

Unfortunately, that’s the best card I have in my labs for CUDA-intensive work.

In any case, do you see any obvious design changes which could help improve the performance?

I’m working on a Quadro 3700 at the moment. Is that 2.0 compatible?

It’s equivalent to GTX 8800. So, i think, it ought to be 2.0 compatible.

And if so, thats the performance i’m getting as of now! A half-an-hour to do the above mentioned sort!

Unfortunately, that’s the best card I have in my labs for CUDA-intensive work.

In any case, do you see any obvious design changes which could help improve the performance?

No, sadly a Quadro 3700 is a compute 1.1 device, which means the penalty for uncoalesced reads is large. Assuming you could run on a GTX 4xx or Tesla C2050/2070 (which are compute 2.0 devices), you would not need to do anything special to benefit from the cache.

No, sadly a Quadro 3700 is a compute 1.1 device, which means the penalty for uncoalesced reads is large. Assuming you could run on a GTX 4xx or Tesla C2050/2070 (which are compute 2.0 devices), you would not need to do anything special to benefit from the cache.

Oh. That’s sad :mellow:

Well, given I won’t be able to access a Tesla (for sure), is there any other design change which I may consider, given my present design?

Quite frankly, I’ve blown out all the little transistors in my head trying to optimize the thing. And all in vain, so far External Media

It’s quite heart-breaking to see a CPU outdo your GPU code, you see. Any suggestions would brighten my day…

Oh. That’s sad :mellow:

Well, given I won’t be able to access a Tesla (for sure), is there any other design change which I may consider, given my present design?

Quite frankly, I’ve blown out all the little transistors in my head trying to optimize the thing. And all in vain, so far External Media

It’s quite heart-breaking to see a CPU outdo your GPU code, you see. Any suggestions would brighten my day…

I don’t have any good ideas. Sorting large, variable size elements is a tough problem.

I don’t have any good ideas. Sorting large, variable size elements is a tough problem.

Yes. Apparently, it is hard.

Umm, have you gone through the paper by Satish et al. titled “Designing Effcient Sorting Algorithms for Manycore GPUs”?

If so, do you see how I may incorporate the logic presented there?

Somehow, I feel totally disconnected about the logic presented there, which seems so very implementable and my present conundrum here.

Yes. Apparently, it is hard.

Umm, have you gone through the paper by Satish et al. titled “Designing Effcient Sorting Algorithms for Manycore GPUs”?

If so, do you see how I may incorporate the logic presented there?

Somehow, I feel totally disconnected about the logic presented there, which seems so very implementable and my present conundrum here.

You might look into the approach this programmer took to sort strings with Thrust:

[url=“http://ldn.linuxfoundation.org/article/c-gpu-and-thrust-strings-gpu”]http://ldn.linuxfoundation.org/article/c-g...ust-strings-gpu[/url]

You might look into the approach this programmer took to sort strings with Thrust:

[url=“http://ldn.linuxfoundation.org/article/c-gpu-and-thrust-strings-gpu”]http://ldn.linuxfoundation.org/article/c-g...ust-strings-gpu[/url]

And yes, would it be possible to for you to test my code out on a Tesla, if you have one lying around you? :mellow:

And yes, would it be possible to for you to test my code out on a Tesla, if you have one lying around you? :mellow:

Could you store the words in a large 1D texture (of type uchar) maybe? Then you get the benefits of the texture cache.
You’d have to store X offsets in the texture then instead of pointers to the actual word locations.

Are you currently swapping the words in memory, or just swapping the pointers to them?

Could you store the words in a large 1D texture (of type uchar) maybe? Then you get the benefits of the texture cache.
You’d have to store X offsets in the texture then instead of pointers to the actual word locations.

Are you currently swapping the words in memory, or just swapping the pointers to them?

Yes, I have GTX 470 downstairs. If your code is self-contained and will compile in Linux, I’m more than happy to try it out.