How to stridedly read data for Tensor core?

I have data in global memory and I want to read them like:
a[0], a[3], a[6], a[9]…

But what I find in load_matrix_sync or in PTX,wmma command is only stride for different rows. What should I do? Thanks!!!

https://docs.nvidia.com/cuda/archive/10.0/parallel-thread-execution/index.html#warp-level-matrix-storage

You can compact the elements in shared memory before using load_matrix from that shared memory.

1 Like

Well…I think so… No other choice… Thanks!!!