How to stridedly read data for Tensor core?

I have data in global memory and I want to read them like:
a[0], a[3], a[6], a[9]…

But what I find in load_matrix_sync or in PTX,wmma command is only stride for different rows. What should I do? Thanks!!!

You can compact the elements in shared memory before using load_matrix from that shared memory.

Well…I think so… No other choice… Thanks!!!