Need help understanding thrust::scatter!

MutantJohn · May 5, 2015, 1:43am

Okay, I just found out a really cool thing to do is sort any number of arrays by key by splitting up the operation like so:

thrust::sequence(indices.begin(), indices.end());
thrust::sort_by_key(keyvals.begin(), keyvals.end(), indices.begin());

foreach (vector_to_be_sorted)
    thrust::scatter(data.begin(), data.end(), indices.begin(), sorted_data.begin());

Reading the documentation it says, “For each iterator i in the range [first, last), the value i is assigned to output[(map + (i - first))].”

Okay, let’s say we have some data.

keys = { 0, 7, 6, 4, 2, 3, 5, 1 };
data = { a, b, c, d, e, f, g, h };

Doing the initial sort, we get

keys    = { 0, 1, 2, 3, 4, 5, 6, 7 };
indices = { 0, 7, 4, 5, 3, 6, 2, 1 };

In theory, data should look like this at the end :

data = { a, h, e, f, d, g, c, b };

But every time I try to do this on a piece of paper, I swear I’m doing it wrong.

From the description, it sounds

data[0] = a is mapped to indices[0] = 0
data[1] = b is mapped to indices[1] = 7
data[2] = c is mapped to indices[2] = 4

The one for data[2] does not look true at all!

Instead, it seems to work if

output[i] = data[indices[i]]

I think I’m just really confusing myself here…

Robert_Crovella · May 5, 2015, 2:36am

Let’s start with unsorted data, and a linear key sequence. The first trick is to reverse the meaning of keys and values (data):

int keys = { 107, 103, 104, 100 };
int vals = {   0,   1,   2,   3 };

If I then do a thrust sort_by_key, using the above “keys” as my key array and the above “vals” as my values array, I will end up with:

keys = { 100, 103, 104, 107 };
vals = {   3,   1,   2,   0 };

I can now use the “vals” array as the output index into a “results” array to sort any other array according to the same sort order. For example if I had a data array like this:

char data = { 'h', 'd', 'e', 'a' };

and an “empty” results array of char:

char rslt = {  .,  .,   .,  . };

then I can do a thrust::scatter using the sorted vals as my “map” array:

data  = { 'h', 'd', 'e', 'a' };
 vals = {   3,   1,   2,   0 };
 rslt = { 'a', 'd', 'e', 'h' };

‘h’ went to position 3 in the result array because that’s where the index in vals indicated it should go during the scatter.

Indeed,

rslt[vals[i]] = data[i]  for i in 0,1,2...N

MutantJohn · May 5, 2015, 4:45pm

Gah, this is all so confusing! When you do it, it works but when I do it with a slightly larger example it seems to fail.

For example, consider this :

pa  : 7, 3, 4, 0, 2, 5, 1, 6
idx : 0, 1, 2, 3, 4, 5, 6, 7
ta  : a, b, c, d, e, f, g, h

sorted by pa as key...
pa  : 0, 1, 2, 3, 4, 5, 6, 7
idx : 3, 6, 4, 1, 2, 5, 7, 0
ta  : d, g, e, b, c, f, h, a

But when I run this code, the output is wrong! I’m getting

h 
d
e
a
c
f
b
g

But with my own custom kernel, I get the output that seems correct :

d
g
e
b
c
f
h
a

Here’s the code :

// nvcc -std=c++11 -O3 -o scatter scatter.cu 

#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/scatter.h>
#include <thrust/sort.h>

const int tpb = 256,
          bpg = 512;

__device__
int currThreadID(void)
{
    return blockDim.x * blockIdx.x + threadIdx.x;
}

__device__
int grid_size(void)
{
    return blockDim.x * gridDim.x;
}

__global__
void scatter
(
const int n,
const int * __restrict__ input,
const int * __restrict__ indices,
int * __restrict__ output
)
{
    for (int tid = currThreadID(); tid < n; tid += grid_size())
    {
        const int idx = indices[tid];
        const int tmp = input[idx];
        output[tid] = tmp;
    }
}

int main(void)
{
    const int  pa[8] = {  7,   3,   4,   0,   2,   5,   1,   6 };
    const int idx[8] = {  0,   1,   2,   3,   4,   5,   6,   7 };
    const char ta[8] = { 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h' };

    thrust::device_vector<int> d_pa(pa, pa + 8),
                               d_idx(idx, idx + 8),
                               d_ta(ta, ta + 8),
                               output(8, -1);

    thrust::sort_by_key(d_pa.begin(),
                        d_pa.end(),
                        d_idx.begin());

    /*thrust::scatter(d_ta.begin(),
                    d_ta.end(),
                    d_idx.begin(),
                    output.begin());*/

    scatter<<<bpg, tpb>>>(8, thrust::raw_pointer_cast(d_ta.data()), thrust::raw_pointer_cast(d_idx.data()), thrust::raw_pointer_cast(output.data()));

    cudaDeviceSynchronize();

    for (int i = 0; i < 8; ++i)
    {
        std::cout << ((char ) output[i]) << std::endl;
    }

    return 0;
}

Robert_Crovella · May 5, 2015, 4:57pm

Your kernel does not match the scatter operation provided by thrust.

Your kernel is doing this:

output[i] = input[indices[i]];

The thrust scatter operation is:

output[indices[i]] = input[i];

They are not the same. (Take a look at the last line of my previous post.)

If you rewrite your kernel as follows:

__global__
void scatter
(
const int n,
const int * __restrict__ input,
const int * __restrict__ indices,
int * __restrict__ output
)
{
    for (int tid = currThreadID(); tid < n; tid += grid_size())
    {
        const int idx = indices[tid];
        const int tmp = input[tid];
        output[idx] = tmp;
    }
}

I think you’ll get results that match betwen your kernel and thrust.

Note that this realization:

output[i] = input[indices[i]];

is the thrust::gather operation:

https://thrust.github.io/doc/group__gathering.html#ga6fdb1fe3ff0d9ce01f41a72fa94c56df

MutantJohn · May 5, 2015, 5:45pm

Interesting. Then why is my thrust::scatter operation giving wrong results? Why does gathering seem to work then?

Robert_Crovella · May 5, 2015, 7:09pm

Either you are not understanding what I said, or I am not understanding what you said. I interpreted this excerpt from your comment:

to mean that those were the results you were getting from thrust::scatter. (they are certainly not the results that your kernel produced.) Those are the correct results. If you modify your kernel to match what I indicated, then your kernel will produce that same output.

MutantJohn · May 5, 2015, 10:05pm

Am I doing my sort_by_key correctly? I swear to God I’m doing it right on paper but every time I use thrust::scatter, I get weird results.

Have you been able to successfully make the sorting work using thrust::scatter? I know you posted that small example but why is it not working for my dataset?

The sorted characters should be :

d g e b c f h a

Robert_Crovella · May 6, 2015, 2:49am

our examples are different. The first line of my first response was “let’s start with unsorted data”.

My data order “matched” my key order:

int keys = { 107, 103, 104, 100 };
char data = { 'h', 'd', 'e', 'a' };

Your data order does not:

pa  : 7, 3, 4, 0, 2, 5, 1, 6
ta  : a, b, c, d, e, f, g, h

In my case, I consider the sort to be possibly “useful”.

In your case it is not useful - you already have the identified order that you want the data to appear in. That does not require a sort, and it can be done with a simple indexed copy operation (thrust::copy with a permutation iterator, if you wish. Or a trivial CUDA kernel.)

MutantJohn · May 6, 2015, 3:36pm

Whoa, permutation iterators look awesome! There’s so many amazing things in thrust!

Robert_Crovella · May 6, 2015, 4:01pm

Nevertheless, I don’t feel my explanation is satisfactory or compelling. If time permits, I will revisit and attempt to write up something that is clear and concise. I agree that your case and mine don’t seem to jibe, and I believe the difference is in the formulation of the problem/question. The behavior of the thrust functions is quite clear however, at least in my opinion. If you find that gather does what you want, then use it.

Topic		Replies	Views
How to efficiently sort 5 arrays of integers? CUDA Programming and Performance	7	1162	June 19, 2015
Thrust::gather for input and output sequences that coincide CUDA Programming and Performance	6	1029	November 16, 2020
Implementation of Thrust Sort By Key for Multi-values (CUDA Fortran & CUDA C) CUDA Programming and Performance cuda	3	927	February 22, 2023
Sorting in CUDA Is sorting in CUDA worth the trouble? CUDA Programming and Performance	15	8069	September 30, 2009
sort a matrix row wise thrust sort is slower on gpu than its cpu counterpart CUDA Programming and Performance	4	1909	February 4, 2012
Inconsistent results for reduction, except while printf or cudamemcheck CUDA Programming and Performance	29	2394	September 13, 2016
Is it possible to somehow fuse thrust::partition and thrust::sort? CUDA Programming and Performance	6	1615	September 11, 2016
CUDA 8 - Thrust bug(?) CUDA Programming and Performance	16	4126	October 2, 2016
Thrust sort Need some help CUDA Programming and Performance	7	7252	October 21, 2009
Passing thurst vector into kernel and pushing data into vector CUDA Programming and Performance	8	7848	January 2, 2018

Need help understanding thrust::scatter!

Related topics