Thrust::gather for input and output sequences that coincide

tomilovanatoliy · November 16, 2020, 10:11am

How actually thrust::gather is arranged? Why it is so differ from thrust::transform, which allows for input and output sequences to coincide? Currently I have to rewrite the following code:

thrust::gather(seq.cbegin(), seq.cend(), map.cbegin(), seq.begin());

to

auto seqBegin = thrust::make_permutation_iterator(map.cbegin(), seq.cbegin());
thrust::transform(seqBegin, thrust::next(seqBegin, seq.size()), seq.begin(), thrust::identity<typename thrust::iterator_value<decltype(seqBegin)>::type>{});

which looks like overengineering.

The same question is about thrust::copy* (in conjunction with thrust::transform_iterator/thrust::transform_output_iterator).

Can you (Thrust devs) ease the requirements for the algorithms?

Robert_Crovella · November 16, 2020, 2:00pm

How actually thrust::gather is arranged? Why it is so differ from thrust::transform , which allows for input and output sequences to coincide?

All of my comments here are predicated on “when thrust is using the CUDA backend”.

The implication of thrust::transform when input and output sequences coincide is that there is a 1:1 correspondence between input and output. This implication is not the case when using thrust::gather or a permutation iterator in the same situation.

In CUDA, there is no specification of the order of thread execution.

Let’s take a simple example.

Suppose I have an input sequence of 2 elements. Suppose that I simply desire to add 1 to each element. with thrust::transform, it is legal to have the output coincide with the input in this case. Even though there is no order of thread execution, it does not matter, because the read of an element will happen before the write of an element, because they are handled by a single CUDA thread, and the implication of thrust::transform is that no other element’s processing depends on that element. So order of execution of CUDA threads does not matter.

Now suppose instead our desire is to reverse the elements. We might do this with thrust::gather or with a permutation iterator. Neither method is guaranteed to work correctly when the output and input coincide. Suppose thread 0 executes first:

Input: x1 x2
output: x1 x1

Since the output is the same as the input, if thread 1 executes sometime later, the result will still be:

output: x1 x1

whereas we desire:

output: x2 x1

Thrust doesn’t sort this out for you, using any of the methods you suggest.

Currently I have to rewrite the following code:

That won’t fix it.

Can you (Thrust devs) ease the requirements for the algorithms?

The right way to make a request of thrust devs is to file a thrust issue.

tomilovanatoliy · November 16, 2020, 2:14pm

Thank you for such a detailed answer!

So the only reliable way is to use double buffering for all such a cases?

Your example for reverse is quite different from mine. Because each element in either transform or gather version is updated in-place absolutely separately from others. There is no any data dependency between different elements at all (if map is not overlapped with input). Elements can be processed block-wise (e.g. 32 per “chunk”) and still memory read-then-write ordering would be correct.

Another example, that actually works (again on CUDA backend) is “proxy”-sort (get positions of elements from input sequence as if they sorted):

#include <thrust/device_vector.h>
#include <thrust/sequence.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/sort.h>
#include <thrust/advance.h>
#include <thrust/copy.h>

#include <iterator>
#include <iostream>

int main()
{
    int init[] = {2, 0, 1, 3, 4};
    const thrust::device_vector<int> v{std::cbegin(init), std::cend(init)};
    
    thrust::device_vector<std::intptr_t> index{v.size()};
    thrust::sequence(index.begin(), index.end());
    
    auto key = thrust::make_permutation_iterator(thrust::make_transform_iterator(v.cbegin(), thrust::identity<thrust::tuple<int>>{}), index.cbegin());
    thrust::sort_by_key(key, thrust::next(key, index.size()), index.begin());
    
    thrust::copy(index.cbegin(), index.cend(), std::ostream_iterator<std::intptr_t>(std::cout, ", "));
    std::cout << std::endl;
}

transform_iterator trick makes operator = for elements of v a no-op.

According to what you said it is also incorrect, but in practice it works without any problems.

Robert_Crovella · November 16, 2020, 2:34pm

EDIT: I am speaking generally in this comment. I’m not making any pronouncements that your new example is corrrect or incorrect.

If you’d like to do things that don’t adhere to thrust expectations, and are comfortable because " in practice it works without any problems", you’re welcome to do whatever you wish.

The question I was trying to answer (mainly) was “why does thrust specify these things this way”. In the general case, we can find examples of things that don’t work correctly, and we can explain why they are not expected to work.

Good luck!

tomilovanatoliy · November 16, 2020, 2:42pm

Thank you.
But it seems you didn’t understand the example. According with what you said there is not a problem actually. But you insist on UB. Good.

Maybe in the Thrust documentation for gather there is mistake in parameters naming or in Precondition statement: map and input names have to be swapped.

Robert_Crovella · November 16, 2020, 5:08pm

But it seems you didn’t understand the example

I haven’t looked at your examples at all. we started out talking about thrust::gather and using thrust::permutation iterator with thrust::transform in place of thrust gather. That is what I was responding to. You then introduced something that looks completely different to me (a sort), and tried to extend what I said to suggest I said it is incorrect. I never said that. Thrust sort uses O(N) temporary space, so it is effectively double-buffered.

I thought you were asking for insight. With respect to correctness for thrust::gather, never mind what I said. Follow the documentation:

The range [map_first, map_last) shall not overlap the range [result, result + (map_last - map_first)) .

If you’d like to see a change in that requirement for correctness, file a thrust issue.

I don’t have any comments relevant to anything else you have posted here.

tomilovanatoliy · November 16, 2020, 5:11pm

Now I understand. I assumed you read whole the question. I’ll keep in mind this usually not true.

I already made an issue. Thank you for your patience.

Example with sort is irrelevant, now I know.