I have understood with “CUDA Techniques to Maximize Memory Bandwidth and Hide Latency [S72683]” that thrust::transform() kindly implements advanced low level functions like __pipeline_memcpy_async()to benefit from the performance boost.
But I don’t get the clever way to use if I have for a multiple-arguments kernel, lets say some
So, thrust::zip indeed does not seem to allow automatic pipelined prefetching.
I wonder if it is because proclaim_copyable_arguments is only a hint about memory loading, but does not tell anything about aliasing.
Anyway, automatic prefetching for multiple data seems like a difficult task to optimize for the compiler, considering the instruction dependencies and latencies to investigate before determining the best schedule (but it would be great, though).
Simply speaking proclaim_copyable_arguments specifies that the inputs to the transform function do not need to come directly from global memory, but also from e.g. shared memory. This allows to do async copies from global memory to shared memory while applying the transformation function to other data (served from shared memory).
Without proclaim_copyable_arguments, the data could still be prefetched from global memory to L2 cache.
However, the low-level instructions in both cases require pointer addresses so it does not work for arbitrary iterators.
This. Historical evidence regarding compiler (or even human) initiated prefetching of data in software indicates that it is a losing game when attempted across multiple generations of processors, or even just the low range and high range of the same architecture generation (with different memory subsystems). Hardware-initiated prefetching, ideally with support for some simple common access patterns, is the way to go.