Can you "hide" the cost of kernels through kernel fusion? E.g the cost of matrix transpose

A matrix transpose can be done at a high throughput: An Efficient Matrix Transpose in CUDA C/C++ | NVIDIA Technical Blog

Many algorithms can run faster if the input is transposed ahead of time, due to better cache locality. But it is very often not time-efficient to do so.
Essentially, the improvement in speed from using transposed inputs is less than the time cost to do the transpose.
A concrete example is the batched tridiagonal solvers in the cuSPARSE API: gtsvInterleavedBatch is probably superior to gtsv2StridedBatch in speed in many cases, but it requires you to interleave (I.e. transpose) inputs. Tridiagonal solving does not take a lot of time, so it is rarely time-efficient to do the interleaving I would assume, since then you get to pay the cost of two memory-bound kernels in sequence.

But what if we need to go over the data regardless. E.g. If we need to calculate the average value in an array, or do a vector-scalar multiplication, type conversion, or some other memory-bound operation.
Then if that is done at the same time as the transpose, in a kernel fusion, I assume this will almost always be a superior solution, right?

I mean if you have to go over the data, why not do the transposing when you’re already there, if possible, right?

sure, but you have now burdened your kernel with the cost of writing data, in addition to reading it

The (time) cost to read an array, transpose it, and write it out should be approximately twice the cost to read an array and transpose it, without writing out the transposed values.

you can calculate the average value of an array without writing any of the array elements.

So it probably depends on what else you are doing. I doubt you could say the transpose is a win, by itself.

I don’t think this is a question easily answered. In general, explicit transposition of matrices should be avoided. Faced with a situation where one “must” transpose, it sounds plausible that minimizing overall data movement by incorporating the transposition into some other processing the data would be beneficial, but the devil may be in the details.

Running an actual experiment instead of engaging in a gedankenexperiment will provide an answer specific to your use case.

Haha, yeah that was a bad example.

So I need to do 1. conversion from int16 to half/float. 2. subtraction of mean value, 3. “same”-size convolution, 4. beamforming.
The beamforming takes data I.e. 252000x4, representing 4 batches of data. But if I can transpose the data to 4x252000, all four elements of a batch can be read in a single LDG operation. I was mainly thinking convolution could be combined with transposition.

I wanted to hear if maybe you have some intuition, since it will take considerable time to implement.

Yesterday’s GTC talk about the CUTLASS library talked about fusing an arithmetic kernel and an output value permutation to avoid the read/write overhead. They also benchmarked very impressive gains in an example case.

Here’s the link to the session

1 Like

I see they talk about it around 9:20. Thanks for the heads-up!
I wasn’t aware that CUTLASS has a convolution implementation - this could be helpful to use.