BUG: CUDA Programming Guide memcpy_async pipeline example is incorrect

The example in section “Tracking Asynchronous Memory Operations” (https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/pipelines.html#tracking-asynchronous-memory-operations) shows:

cuda::memcpy_async(buffer, in, sizeof(float), pipeline);

Claiming “every thread fetches one element” but all threads pass the same buffer/in pointers. This implies implicit thread offset which doesn’t exist.

Actual behavior: All threads copy the same float to the same shared memory buffer position. The code “works” only because they all write the same value.

Users following this pattern for shared memory copies get incorrect results.
Explicit +threadIdx.x offset is required.

There are several typos and mistakes left in the reworked programming guide. You can report them here: How to report a bug