combining thrust::{zip_iterator,transform_iterator,counting_iterator} for modified summed area table

I’ve been trying to wrap my head around how to achieve this, but am struggling to understand how to combine these iterators. The full example for the summed area table is here: thrust/summed_area_table.cu at master · NVIDIA/thrust · GitHub

I tried my best to only include the relevant details, apologies for the code-dump! I’m just not sure how to combine all three of these iterators. So, to start, the original iterator we’re working with:

// small modification to my code-base, I have all of these at compile time
template <class Projection>
struct row_index : public thrust::unary_function<unsigned, unsigned> {
    static constexpr unsigned cols = Projection::Depth::Width;
  
    __host__ __device__
    unsigned operator()(unsigned i) {
        return i / cols;
    }
};

// ... in some method below ...

static constexpr unsigned cols = Projection::Depth::Width;
static constexpr unsigned rows = Projection::Depth::Height;

thrust::counting_iterator<unsigned> indices(0);

thrust::inclusive_scan_by_key(
    thrust::make_transform_iterator(indices, row_index<Projection>()),
    thrust::make_transform_iterator(indices, row_index<Projection>()) + (rows * cols),
    // ...

But what I actually need to do here is zip together an extra input to compute a modified summed area table (specifically, the “second order derivative”). Sparing the gory details, my input comes in as float4, call this positions, where we actually only care about the XYZ in this case. The W coordinate is used elsewhere. So the second order would be a float6. I wrote a simple extrapolation kernel to “upcast” the input float4 to a float6 array to actually do the scanning / transposing with, which works fine. But I need the original points to compute this, so I thought a zip iterator of some kind would be the way to do this. I’m trying to achieve the following binary operator

using f4x6 = thrust::tuple<float4, float6>;

struct derive : public thrust::binary_function<f4x6, f4x6, float6> {
    __host__ __device__
    float6 operator()(const f4x6 &x, const f4x6 &y) {
        float6 ret = thrust::get<1>(x) + thrust::get<1>(y);
        float4 y4  = thrust::get<0>(y);
        ret.a += y4.x * y4.x;// this may need to be
        ret.b += y4.x * y4.y;// thrust::get<0>(x), but
        ret.c += y4.x * y4.z;// I need to make the iterator
        ret.d += y4.y * y4.y;// first...
        ret.e += y4.y * y4.z;
        ret.f += y4.z * y4.z;
        return ret;
    }
};

I’ve been staring at the awesome answers here: thrust::exclusive_scan with thrust::zip_iterator? - CUDA Programming and Performance - NVIDIA Developer Forums

but cannot seem to figure out how to approach building out the iterators. I can’t figure out how to create a transform iterator from both a zip iterator and a counting iterator. Something like

thrust::make_transform_iterator(
    thrust::make_zip_iterator(
        thrust::make_tuple(float4_src, float6_src) // device pointers
    ) [ ? ] "AND indices"

I clearly don’t get how (or if) this can be done, I can’t get anything that I think would make sense to compile.

Any thoughts / suggestions? I just need to access the original input while building out the sum. Thanks for any advice!