How to prevent excessive memory reads in case of zip iterator (THRUST)?

I have a plenty of code like that:

size_type size = 100;
thrust::cuda::vector< T > a{size}, b{size}, c{size}, result{size};
thrust::cuda::vector< int > selector{size};
// a, b, c and selector are filled with something meaningful here
auto it = thrust::make_zip_iterator(thrust::make_tuple(a.cbegin(), b.cbegin(), c.cbegin()))
using arg_type = typename decltype(it)::value_type;
auto modify = [] __host__ __device__ (arg_type arg, int selector) -> T
{
    switch (selector) {
    case 0 : {
        return thrust::get< 0 >(arg);
    }
    case 1 : {
        return thrust::get< 1 >(arg);
    }
    default : {
        return thrust::get< 2 >(arg);
    }
    }
};
thrust::transform(p, it, thrust::next(it, size), selector.cbegin(), result.begin(), modify);

There are even a much more complex combinations of tuples of tuples of iterators, resulting after application of zip and permutation iterators’ factory functions. It is not a misuse of zip iterator in my case because transition AoS → SoA is neccessary here, due to all the resulting arrays are eventually used with another ones in diverse combinations repeatedely.
When I print arg_type I see that it is just a tuple of values (maybe of tuples of values of tuples of values etc in real case), not a tuple of references at the very end. This means for me, that in case of above modify functor I have triple memory bandwidth consumption wasted unconditionally, isn’t it? I mean arg receives all three float values before functor body execution for every index, with no dependance on selector value, is it true?
For above case I have std::is_same< arg_type, thrust::tuple< float, float, float > >. I may stop to use handy decltype operator and to specify exact needed type manually instead: using arg_type = thrust::tuple< const float &, const float &, const float & >;. But for (mentioned above) complex types it is a boring way. Also I can invent metafunction using templates, which can infer desired type in recursive manner, but it seems a too complex way.
Are my suspicious about bandwidth waste all correct? Or vice versa: is it possible, that kernel fuse (inlining?, application of SSA form step?) for templated global functions is such good, that no one excessive memory access can take place?