Need help understanding thrust::scatter!

Okay, I just found out a really cool thing to do is sort any number of arrays by key by splitting up the operation like so:

thrust::sequence(indices.begin(), indices.end());
thrust::sort_by_key(keyvals.begin(), keyvals.end(), indices.begin());

foreach (vector_to_be_sorted)
    thrust::scatter(data.begin(), data.end(), indices.begin(), sorted_data.begin());

Reading the documentation it says, “For each iterator i in the range [first, last), the value i is assigned to output[(map + (i - first))].”

Okay, let’s say we have some data.

keys = { 0, 7, 6, 4, 2, 3, 5, 1 };
data = { a, b, c, d, e, f, g, h };

Doing the initial sort, we get

keys    = { 0, 1, 2, 3, 4, 5, 6, 7 };
indices = { 0, 7, 4, 5, 3, 6, 2, 1 };

In theory, data should look like this at the end :

data = { a, h, e, f, d, g, c, b };

But every time I try to do this on a piece of paper, I swear I’m doing it wrong.

From the description, it sounds

data[0] = a is mapped to indices[0] = 0
data[1] = b is mapped to indices[1] = 7
data[2] = c is mapped to indices[2] = 4

The one for data[2] does not look true at all!

Instead, it seems to work if

output[i] = data[indices[i]]

I think I’m just really confusing myself here…

Let’s start with unsorted data, and a linear key sequence. The first trick is to reverse the meaning of keys and values (data):

int keys = { 107, 103, 104, 100 };
int vals = {   0,   1,   2,   3 };

If I then do a thrust sort_by_key, using the above “keys” as my key array and the above “vals” as my values array, I will end up with:

keys = { 100, 103, 104, 107 };
vals = {   3,   1,   2,   0 };

I can now use the “vals” array as the output index into a “results” array to sort any other array according to the same sort order. For example if I had a data array like this:

char data = { 'h', 'd', 'e', 'a' };

and an “empty” results array of char:

char rslt = {  .,  .,   .,  . };

then I can do a thrust::scatter using the sorted vals as my “map” array:

data  = { 'h', 'd', 'e', 'a' };
 vals = {   3,   1,   2,   0 };
 rslt = { 'a', 'd', 'e', 'h' };

‘h’ went to position 3 in the result array because that’s where the index in vals indicated it should go during the scatter.

Indeed,

rslt[vals[i]] = data[i]  for i in 0,1,2...N

Gah, this is all so confusing! When you do it, it works but when I do it with a slightly larger example it seems to fail.

For example, consider this :

pa  : 7, 3, 4, 0, 2, 5, 1, 6
idx : 0, 1, 2, 3, 4, 5, 6, 7
ta  : a, b, c, d, e, f, g, h

sorted by pa as key...
pa  : 0, 1, 2, 3, 4, 5, 6, 7
idx : 3, 6, 4, 1, 2, 5, 7, 0
ta  : d, g, e, b, c, f, h, a

But when I run this code, the output is wrong! I’m getting

h 
d
e
a
c
f
b
g

But with my own custom kernel, I get the output that seems correct :

d
g
e
b
c
f
h
a

Here’s the code :

// nvcc -std=c++11 -O3 -o scatter scatter.cu 

#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/scatter.h>
#include <thrust/sort.h>

const int tpb = 256,
          bpg = 512;

__device__
int currThreadID(void)
{
    return blockDim.x * blockIdx.x + threadIdx.x;
}

__device__
int grid_size(void)
{
    return blockDim.x * gridDim.x;
}

__global__
void scatter
(
const int n,
const int * __restrict__ input,
const int * __restrict__ indices,
int * __restrict__ output
)
{
    for (int tid = currThreadID(); tid < n; tid += grid_size())
    {
        const int idx = indices[tid];
        const int tmp = input[idx];
        output[tid] = tmp;
    }
}

int main(void)
{
    const int  pa[8] = {  7,   3,   4,   0,   2,   5,   1,   6 };
    const int idx[8] = {  0,   1,   2,   3,   4,   5,   6,   7 };
    const char ta[8] = { 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h' };

    thrust::device_vector<int> d_pa(pa, pa + 8),
                               d_idx(idx, idx + 8),
                               d_ta(ta, ta + 8),
                               output(8, -1);

    thrust::sort_by_key(d_pa.begin(),
                        d_pa.end(),
                        d_idx.begin());

    /*thrust::scatter(d_ta.begin(),
                    d_ta.end(),
                    d_idx.begin(),
                    output.begin());*/

    scatter<<<bpg, tpb>>>(8, thrust::raw_pointer_cast(d_ta.data()), thrust::raw_pointer_cast(d_idx.data()), thrust::raw_pointer_cast(output.data()));

    cudaDeviceSynchronize();

    for (int i = 0; i < 8; ++i)
    {
        std::cout << ((char ) output[i]) << std::endl;
    }

    return 0;
}

Your kernel does not match the scatter operation provided by thrust.

Your kernel is doing this:

output[i] = input[indices[i]];

The thrust scatter operation is:

output[indices[i]] = input[i];

They are not the same. (Take a look at the last line of my previous post.)

If you rewrite your kernel as follows:

__global__
void scatter
(
const int n,
const int * __restrict__ input,
const int * __restrict__ indices,
int * __restrict__ output
)
{
    for (int tid = currThreadID(); tid < n; tid += grid_size())
    {
        const int idx = indices[tid];
        const int tmp = input[tid];
        output[idx] = tmp;
    }
}

I think you’ll get results that match betwen your kernel and thrust.

Note that this realization:

output[i] = input[indices[i]];

is the thrust::gather operation:

https://thrust.github.io/doc/group__gathering.html#ga6fdb1fe3ff0d9ce01f41a72fa94c56df

Interesting. Then why is my thrust::scatter operation giving wrong results? Why does gathering seem to work then?

Either you are not understanding what I said, or I am not understanding what you said. I interpreted this excerpt from your comment:

to mean that those were the results you were getting from thrust::scatter. (they are certainly not the results that your kernel produced.) Those are the correct results. If you modify your kernel to match what I indicated, then your kernel will produce that same output.

Am I doing my sort_by_key correctly? I swear to God I’m doing it right on paper but every time I use thrust::scatter, I get weird results.

Have you been able to successfully make the sorting work using thrust::scatter? I know you posted that small example but why is it not working for my dataset?

The sorted characters should be :

d g e b c f h a

our examples are different. The first line of my first response was “let’s start with unsorted data”.

My data order “matched” my key order:

int keys = { 107, 103, 104, 100 };
char data = { 'h', 'd', 'e', 'a' };

Your data order does not:

pa  : 7, 3, 4, 0, 2, 5, 1, 6
ta  : a, b, c, d, e, f, g, h

In my case, I consider the sort to be possibly “useful”.

In your case it is not useful - you already have the identified order that you want the data to appear in. That does not require a sort, and it can be done with a simple indexed copy operation (thrust::copy with a permutation iterator, if you wish. Or a trivial CUDA kernel.)

Whoa, permutation iterators look awesome! There’s so many amazing things in thrust!

Nevertheless, I don’t feel my explanation is satisfactory or compelling. If time permits, I will revisit and attempt to write up something that is clear and concise. I agree that your case and mine don’t seem to jibe, and I believe the difference is in the formulation of the problem/question. The behavior of the thrust functions is quite clear however, at least in my opinion. If you find that gather does what you want, then use it.