How to prevent excessive memory reads in case of zip iterator (THRUST)?

tomilovanatoliy · October 2, 2019, 7:23pm

I have a plenty of code like that:

size_type size = 100;
thrust::cuda::vector< T > a{size}, b{size}, c{size}, result{size};
thrust::cuda::vector< int > selector{size};
// a, b, c and selector are filled with something meaningful here
auto it = thrust::make_zip_iterator(thrust::make_tuple(a.cbegin(), b.cbegin(), c.cbegin()))
using arg_type = typename decltype(it)::value_type;
auto modify = [] __host__ __device__ (arg_type arg, int selector) -> T
{
    switch (selector) {
    case 0 : {
        return thrust::get< 0 >(arg);
    }
    case 1 : {
        return thrust::get< 1 >(arg);
    }
    default : {
        return thrust::get< 2 >(arg);
    }
    }
};
thrust::transform(p, it, thrust::next(it, size), selector.cbegin(), result.begin(), modify);

There are even a much more complex combinations of tuples of tuples of iterators, resulting after application of zip and permutation iterators’ factory functions. It is not a misuse of zip iterator in my case because transition AoS → SoA is neccessary here, due to all the resulting arrays are eventually used with another ones in diverse combinations repeatedely.
When I print arg_type I see that it is just a tuple of values (maybe of tuples of values of tuples of values etc in real case), not a tuple of references at the very end. This means for me, that in case of above modify functor I have triple memory bandwidth consumption wasted unconditionally, isn’t it? I mean arg receives all three float values before functor body execution for every index, with no dependance on selector value, is it true?
For above case I have std::is_same< arg_type, thrust::tuple< float, float, float > >. I may stop to use handy decltype operator and to specify exact needed type manually instead: using arg_type = thrust::tuple< const float &, const float &, const float & >;. But for (mentioned above) complex types it is a boring way. Also I can invent metafunction using templates, which can infer desired type in recursive manner, but it seems a too complex way.
Are my suspicious about bandwidth waste all correct? Or vice versa: is it possible, that kernel fuse (inlining?, application of SSA form step?) for templated global functions is such good, that no one excessive memory access can take place?

Topic		Replies	Views
Help requested for a three vector thrust::transform CUDA Programming and Performance	10	1108	August 8, 2022
Question about transform_iterator of zip_iterator! CUDA Programming and Performance	3	1517	August 29, 2016
Nested Zip_Iterator for output in THRUST GPU-Accelerated Libraries	2	1491	April 3, 2015
Thrust::async::for_each() with zip_iterators CUDA Programming and Performance	3	651	January 30, 2023
Performance of memory coalescing in CUDA using Thrust: structure of array vs. array of structure CUDA Programming and Performance	0	821	August 29, 2018
Thrust `zip_iterator` with arbitrary number of iterators CUDA Programming and Performance	8	459	September 3, 2024
Thrust::transform for multiple inputs/outputs CUDA Programming and Performance	6	201	March 31, 2025
Need help debugging Stride Iterator CUDA Programming and Performance	4	1403	July 18, 2017
Why is this zip iterator not zipping? (Template programming is hard) CUDA Programming and Performance	3	666	July 8, 2014
How to optimize functor for transform iterators CUDA Programming and Performance	15	2051	June 19, 2019

How to prevent excessive memory reads in case of zip iterator (THRUST)?

Related topics