Loading incorrect data from pinned memory when using PTX `st` instruction

Hi there,
I am building an application that loads data from host memory through a fixed-size pinned-mapped host memory.

I partitioned the fixed-size pinned memory into multiple slots. Once the cuda-kernel found there is data to consume, it moves the data from that pinned memory to GPU memory. So it can pipeline the data movement from host to device, with data movement on the host, i.e. data is from the network.

The problem is that I found my cuda kernel implementation fails to load the updated data from pinned memory.
That says, when the host memory is updated, the cuda kernel still loads the old data.
For example, in the case that, if the pinned memory is partitioned into 4 slots, each slot holds 1 float at most, then loading 8 floats to GPU takes 8 steps, as each step GPU moves 1 float from 1 slot. The host program progressively copies 1 float into 1 slot, if there is a free slot. If there are no free slots, the host program will wait.

If host copies 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, the GPU is supposed to get exact same values. But it actually gets 1.0, 2.0, 3.0, 4.0, 1.0, 2.0, 3.0, 4.0 from the pinned buffer.

The full test code is available at here.
The final outputs of the program indicate the data on the GPU is not equal to the source data at host.

Because pinned memory is directly accessible on the device, I store the data to the destination address with the following directStore128 function.

inline __device__ void directStore128(Pack128* p, const Pack128* v) {
  asm volatile("st.volatile.global.v2.u64 [%0], {%1,%2};" :: "l"(p), "l"(v->x), "l"(v->y) : "memory");
  • Other notes: for kernel to know there is data on the host to move, it checks whether the head - tail < N_SLOT, where head and tail are two values shared between host and GPU. When the host posted data into a buffer slot, it increases the tail. When the GPU moved the data of a buffer slot, it increases the head.

The issue is verified on both “T4” and “V100” GPUs.

Any suggestions?