Thrust::transform for multiple inputs/outputs

I have understood with “CUDA Techniques to Maximize Memory Bandwidth and Hide Latency [S72683]” that thrust::transform() kindly implements advanced low level functions like __pipeline_memcpy_async()to benefit from the performance boost.

But I don’t get the clever way to use if I have for a multiple-arguments kernel, lets say some

__device__ void f(const float* src1, const float* src2, const float* src3, float* dst1, float* dst2) 
{
  *dst1 = f1(src1, src2, src3);
  *dst2 = f2(src1, src2, src3);
  *dst3 = f3(src1, src2, src3);
}

Is there some concept/convention (based onthrust::tie ?) that could be used to call thrust::transform() with f() and no overhead ?

One can use thrust::zip_iterator to combine multiple ranges into a single virtual range

But is there any way to know if thrust::transform() can still use __pipeline_memcpy_async() with a zip ? (outside of inspecting generated PTX) ?

Here is an example with a zip input to thrust transform: Compiler Explorer
It does not use pipelining.

I think pipelining is only possible with raw pointer inputs.

Compiler explorer does CUDA… thanks for the tip.

So, thrust::zip indeed does not seem to allow automatic pipelined prefetching.
I wonder if it is because proclaim_copyable_arguments is only a hint about memory loading, but does not tell anything about aliasing.

Anyway, automatic prefetching for multiple data seems like a difficult task to optimize for the compiler, considering the instruction dependencies and latencies to investigate before determining the best schedule (but it would be great, though).

Thrust / CCCL is developed on github. Feel free to make a feature request.
Transform implementation: cccl/cub/cub/device/dispatch/kernels/transform.cuh at main · NVIDIA/cccl · GitHub

Simply speaking proclaim_copyable_arguments specifies that the inputs to the transform function do not need to come directly from global memory, but also from e.g. shared memory. This allows to do async copies from global memory to shared memory while applying the transformation function to other data (served from shared memory).

Without proclaim_copyable_arguments, the data could still be prefetched from global memory to L2 cache.

However, the low-level instructions in both cases require pointer addresses so it does not work for arbitrary iterators.

This. Historical evidence regarding compiler (or even human) initiated prefetching of data in software indicates that it is a losing game when attempted across multiple generations of processors, or even just the low range and high range of the same architecture generation (with different memory subsystems). Hardware-initiated prefetching, ideally with support for some simple common access patterns, is the way to go.