Many algorithms can run faster if the input is transposed ahead of time, due to better cache locality. But it is very often not time-efficient to do so.
Essentially, the improvement in speed from using transposed inputs is less than the time cost to do the transpose.
A concrete example is the batched tridiagonal solvers in the cuSPARSE API: gtsvInterleavedBatch is probably superior to gtsv2StridedBatch in speed in many cases, but it requires you to interleave (I.e. transpose) inputs. Tridiagonal solving does not take a lot of time, so it is rarely time-efficient to do the interleaving I would assume, since then you get to pay the cost of two memory-bound kernels in sequence.
But what if we need to go over the data regardless. E.g. If we need to calculate the average value in an array, or do a vector-scalar multiplication, type conversion, or some other memory-bound operation.
Then if that is done at the same time as the transpose, in a kernel fusion, I assume this will almost always be a superior solution, right?
I mean if you have to go over the data, why not do the transposing when you’re already there, if possible, right?
I don’t think this is a question easily answered. In general, explicit transposition of matrices should be avoided. Faced with a situation where one “must” transpose, it sounds plausible that minimizing overall data movement by incorporating the transposition into some other processing the data would be beneficial, but the devil may be in the details.
Running an actual experiment instead of engaging in a gedankenexperiment will provide an answer specific to your use case.
So I need to do 1. conversion from int16 to half/float. 2. subtraction of mean value, 3. “same”-size convolution, 4. beamforming.
The beamforming takes data I.e. 252000x4, representing 4 batches of data. But if I can transpose the data to 4x252000, all four elements of a batch can be read in a single LDG operation. I was mainly thinking convolution could be combined with transposition.
I wanted to hear if maybe you have some intuition, since it will take considerable time to implement.
Yesterday’s GTC talk about the CUTLASS library talked about fusing an arithmetic kernel and an output value permutation to avoid the read/write overhead. They also benchmarked very impressive gains in an example case.