How actually thrust::gather
is arranged? Why it is so differ from thrust::transform
, which allows for input and output sequences to coincide?
All of my comments here are predicated on “when thrust is using the CUDA backend”.
The implication of thrust::transform when input and output sequences coincide is that there is a 1:1 correspondence between input and output. This implication is not the case when using thrust::gather or a permutation iterator in the same situation.
In CUDA, there is no specification of the order of thread execution.
Let’s take a simple example.
Suppose I have an input sequence of 2 elements. Suppose that I simply desire to add 1 to each element. with thrust::transform
, it is legal to have the output coincide with the input in this case. Even though there is no order of thread execution, it does not matter, because the read of an element will happen before the write of an element, because they are handled by a single CUDA thread, and the implication of thrust::transform is that no other element’s processing depends on that element. So order of execution of CUDA threads does not matter.
Now suppose instead our desire is to reverse the elements. We might do this with thrust::gather or with a permutation iterator. Neither method is guaranteed to work correctly when the output and input coincide. Suppose thread 0 executes first:
Input: x1 x2
output: x1 x1
Since the output is the same as the input, if thread 1 executes sometime later, the result will still be:
output: x1 x1
whereas we desire:
output: x2 x1
Thrust doesn’t sort this out for you, using any of the methods you suggest.
Currently I have to rewrite the following code:
That won’t fix it.
Can you (Thrust devs) ease the requirements for the algorithms?
The right way to make a request of thrust devs is to file a thrust issue.