I’m trying to develop a sorting operation on a large data-set as an intermediary step to a larger application.
Right now, I have with me the following specifics:
A 1-D character array Words which stores the words in a serial order, i.e, stores one word after another in contiguous memory locations.
Each word in the character array is WIDTH_WORD wide long, where WIDTH_WORD is typically around 400.
I have an integer array, ptr such that ptr[i] stores the offset of the beginning of the i-th word in the Words array.
There are a total of n such words where n is typically in the range 10^4.
In order to do an efficient lexicographic sort on the array Words, I am currently using a rather poor self-designed version of the Bitonic Sort.
Right now, there is high inefficiency in sorting the elements as these are some of the key steps i’m following at the moment:
There are two kernels - one doing the Even sort and the other an Odd sort
Launch the kernel with an aim of letting each thread handle two words in the Words array.
Once launched, the thread makes a global read for ptr[2*threadId] and ptr[2*threadId + 1]. There’s a lot of scattered reads in doing this.
Once the ptr[] elements have been read, call the respective characters of words pointed by them and do character-by-character comparison.
This too involves a lot of non-coalesced global reads.
Once the words’ respective positions are decided after making the character-wise comparison, each thread reflects the changes back into the ptr[] array. Again, lots of scattered writes.
Repeat this procedure till all elements are sorted.
Typically, these two kernels , Odd-sort and Even-sort are launched around (WIDTH_WORD x 10^4) = 10^6.
And this is taking a LOT of time to compute. The CPU, in fact, is able to do better. It’s taking at least half-an-hour to churn the data, at the moment :">
Any suggestions for a design-improvement?
PS: I’ve read Satish et al.'s paper on the efficient Radix Sort method. But, the thing is, I simply am not able to port my current design to their algorithm as I foresee similar scattered global-memory reads if done. Moreover, their algo. deals with things at the bit-level, which, quite frankly, is hard for me to code right now.
No, sadly a Quadro 3700 is a compute 1.1 device, which means the penalty for uncoalesced reads is large. Assuming you could run on a GTX 4xx or Tesla C2050/2070 (which are compute 2.0 devices), you would not need to do anything special to benefit from the cache.
No, sadly a Quadro 3700 is a compute 1.1 device, which means the penalty for uncoalesced reads is large. Assuming you could run on a GTX 4xx or Tesla C2050/2070 (which are compute 2.0 devices), you would not need to do anything special to benefit from the cache.
Could you store the words in a large 1D texture (of type uchar) maybe? Then you get the benefits of the texture cache.
You’d have to store X offsets in the texture then instead of pointers to the actual word locations.
Are you currently swapping the words in memory, or just swapping the pointers to them?
Could you store the words in a large 1D texture (of type uchar) maybe? Then you get the benefits of the texture cache.
You’d have to store X offsets in the texture then instead of pointers to the actual word locations.
Are you currently swapping the words in memory, or just swapping the pointers to them?