# Array of lists

Hi,

I have an array of lists on my device memory, each list has different number of entries.
I want to combine all lists into single list.

It is easy to do so using single thread (go over all lists and copy entries to single list).
Is there efficient way to do it using multiple threads?

One possible approach to consider: assuming you know the length of each list, this would be a variant of a simple copy kernel. Suppose you have one array which is an array of pointers to the starting positions of each list. Suppose you have another array which is the length of each list. And lets assume you have a pointer to allocated space where the output list will be. Each thread will:

1. Get the next list length, and also add previous list length to a running sum of list lengths
2. compare its thread index to the length of the that list
3. If the thread index is less than the length of that list, do the following steps 3-7, else skip to step 8
4. Get the next list pointer
5. Offset that pointer by its thread index
6. Create an output pointer which is the pointer to the output start, and offset it by the sum of all previously processed list lengths (accumulated in step 1), and the thread index
7. Using the pointers created in steps 5, and 6, copy an element from the list to the output list
8. Start over at step 1, but advancing to the next list. Repeat until all lists are processed.

This description assumes that the size of the thread array (i.e. the grid size) is greater than or equal to the longest input list. However a typical modification could be done to instead use an arbitrary size grid along with grid-stride loops. There shouldn’t be any need for any sort of inter-thread synchronization of any kind for this algorithm.

If you have a preponderance of short lists to concatenate, there may be more efficient approaches than this.

Hi,

I have ~100K lists, each has only several entries (and many are empty). I expect a total of only few thousands entries at the combined list.

I think that the solution you suggested will have only minor improvements as all threads will have to scan all lists. I’m thinking of calculating cumulative summation of the lengths, and having multiple threads going through multiple lists in parallel.

Yes, that will be a better approach for very short lists.

Thanks