Using Thrust to operate with vectors

I have three device_vectors - “similarity”, “group_ids”, “object_ids” with a dimension of 5_000_000.
Vectors are already sorted by “group_ids”, and within “group_ids” by “similarity” (double sorting).

“similarity” - assessment of the similarity of the object;
“object_ids” - unique object index;
“group_ids” - index of the group to which the object belongs.

Also, from the input I have the number “limit” and “group_ids_to_search”.

“group_ids_to_search” - a vector with indices “id” of groups for which it is necessary to generate results.
The response must contain no more than “limit” values for each “id” of “group_ids_to_search” and “group_ids”.
If “id” is not present in “group_ids”, fill it with 0 responses.

For example, imagine “similarity”, “group_ids”, “object_ids” vectors of dimension 9:

INPUT:

similarity |0.9, 0.85, 0.8, 0.77,| 0.98, 0.8,|  0.86, 0.7, 0.44|
group_ids  |1,   1,    1,   1,   | 2,    2,  |  3,    3,   3   |
object_ids |1,   4,    5,   2,   | 3,    7,  |  6,    9,   8   |

limit = 2;
group_ids_to_search = {1,4,2,3};

OUTPUT:

similarity_out |0.9, 0.85,| 0,   0, | 0.98, 0.8,|  0.86, 0.7|
group_ids_out  |1,   1,   | 4,   4, | 2,    2,  |  3,    3, |
object_ids_out |1,   4,   | 0,   0, | 3,    7,  |  6,    9, |

Since id=4 from (group_ids_to_search) is not in group_ids, its “similarity_out” and “object_ids_out” are filled with the value 0.

I implement this functionality using the “Thrust” library.

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/functional.h>
#include <thrust/binary_search.h>
#include <thrust/execution_policy.h>
#include <thrust/copy.h>
#include <iostream>

template <typename T>
void printDeviceVector(const std::string& message, const thrust::device_vector<T>& list_ids) {
    std::cout << message<< ": ";
    for (const T& id : list_ids) {
        std::cout << id << " ";
    }
    std::cout << std::endl;
}

int main()
{
    // Create INPUT 
    thrust::device_vector<float> similarity = {0.9, 0.85, 0.8, 0.77, 0.98, 0.8, 0.86, 0.7, 0.44};
    thrust::device_vector<int> group_ids = {1, 1, 1, 1, 2, 2, 3, 3, 3};
    thrust::device_vector<int> object_ids = {1, 4, 5, 2, 3, 7, 6, 9, 8};
    thrust::device_vector<int> group_ids_to_search = {1,4,2,3};
    int limit = 2;

    size_t size_search = group_ids_to_search.size();
    
    // Create OUTPUT 
    thrust::device_vector<float> similarity_out(size_search*limit);
    thrust::device_vector<int> object_ids_out(size_search*limit);
    thrust::device_vector<int> group_ids_out(size_search*limit);

    // Create thrust::device_vector to store pointers to the lower and upper bounds of the array group_ids
    thrust::device_vector<unsigned int> lower_bounds(size_search);
    thrust::device_vector<unsigned int> upper_bounds(size_search);


    auto lower = thrust::lower_bound(
        thrust::device, 
        group_ids.begin(), 
        group_ids.end(), 
        group_ids_to_search.begin(), 
        group_ids_to_search.end(),
        lower_bounds.begin()
        );

    auto upper = thrust::upper_bound(
        thrust::device, 
        group_ids.begin(), 
        group_ids.end(), 
        group_ids_to_search.begin(), 
        group_ids_to_search.end(),
        upper_bounds.begin());


    for (int i=0, sim_j = 0; i<size_search; i++, sim_j = sim_j + limit){

        // Slice by "similarity" and "object_ids"
        thrust::device_vector<float> similarity_slice = {similarity.begin() + lower_bounds[i], similarity.begin() + upper_bounds[i]};
        thrust::device_vector<int> object_ids_slice = {object_ids.begin() + lower_bounds[i], object_ids.begin() + upper_bounds[i]};

        thrust::device_vector<int> group_ids_slice(limit);
        thrust::fill( group_ids_slice.begin(), group_ids_slice.end(),  group_ids_to_search[i]);

        // Resize. If the size of the vector was 0, then it will fill its values with zeros. (For the case when "id" = 4 from group_ids_to_search)
        similarity_slice.resize(limit);
        object_ids_slice.resize(limit);

        // Fill the output arrays
        thrust::copy(similarity_slice.begin(), similarity_slice.end(), similarity_out.begin()+sim_j);
        thrust::copy(object_ids_slice.begin(), object_ids_slice.end(), object_ids_out.begin()+sim_j);
        thrust::copy(group_ids_slice.begin(), group_ids_slice.end(), group_ids_out.begin()+sim_j);

    }
    printDeviceVector("similarity_out", similarity_out);
    printDeviceVector("group_ids_out", group_ids_out);
    printDeviceVector("object_ids_out", object_ids_out);

}

However, on vectors “similarity”, “group_ids”, “object_ids” of size 5_000_000 and “group_ids_to_search” of size 1000, this gives a slow result of ~80_000 microseconds.
If the last operations in the for loop are transferred to “host_vector”, then the result is ~30_000 microseconds, which is also slow.

I don’t have enough understanding of how to get rid of the slow for loop to generate the final vectors.
I think this can be done using thrust::for_each and thrust::transform and functors, but I don’t understand how exactly. Please help.

What is the limit used in practice? Also limit=2 ?
What are the expected min / avg / max elements per group in the input?

Later:
I would suggest the following approach

similarity |0.9, 0.85, 0.8, 0.77,| 0.98, 0.8,|  0.86, 0.7, 0.44|
group_ids  |1,   1,    1,   1,   | 2,    2,  |  3,    3,   3   |
object_ids |1,   4,    5,   2,   | 3,    7,  |  6,    9,   8   |
limit = 2;
group_ids_to_search = {1,4,2,3};

First of all, determine the number of elements per group using reduce by key / runlength encoding.
Then compute an exclusive prefix sum to get the group offsets. Next, determine the copy parameters, and peform the copy.

group_sizes | 4 | 2 | 3 | 
group_offsets | 0 | 4 | 6 |
num_outputs_per_searchid | 2 | 2 | 2 | 2 |
num_outputs_to_copy_per_searchid | 2 | 0 | 2 | 2 |
copy_src_offsets | 0 | - | 4 | 6|
copy_dst_offsets | 0 | 2 | 4 | 6|

Then copy the respective number of elements from src offsets to dst offsets into a zero-initialized output vector.

In practice the “limit” is dynamic, but is expected to be min/avg/max = 2/5/10.
Unique “ids” in “group_ids” ~ 1000.
Unique “id_searchs” in “group_ids_to_search” ~ 1000.
But unique “ids” is not always equal to unique “id_searchs”.

10 groups of 500_000 objects and 1000 groups of 100 objects. Total 5_100_000 objects.

Thank you for your help. I’m trying to apply your solution. But I don’t understand how to get:
num_outputs_to_copy_per_searchid | 2 | 0 | 2 | 2 |
copy_src_offsets | 4 | - | 2 | 3|
copy_dst_offsets | 0 | 2 | 4 | 6|
And how to implement the copying process. Can you help with this question?

num_outputs_to_copy_per_searchid is computed from num_outputs_per_searchid where the entries are set to 0 for search ids that do not exist in the data, i.e. search id 4 in your example.

copy_src_offsets is where the groups begin from which data should be copied.
copy_src_offsets | 4 | - | 2 | 3| is wrong. My mistake. I corrected it above, it should be copy_src_offsets | 0 | - | 4 | 6|.

copy_dst_offsets is the prefix sum of num_outputs_per_searchid

To perform the copy, you could write a kernel that uses 1 warp per group to copy the data. Or you could use cub::DeviceCopy cub::DeviceCopy — CUB 104.0 documentation