Hello!
I am currently analyzing a kernel that performs an attention operation on an NVIDIA RTX 3080 and 4080. The workflow involves the following sequence:
- A Tensor Core GEMM operation.
- A softmax computation (exponential using
hexp2
for FP16 values) using GEMM output. - Another GEMM operation using the softmax output.
The GEMM operations utilize mma.sync.aligned.m16n8k16.row.col.f16.f16.f16.f16. Based on profiling results from Nsight Systems and Nsight Compute, TensorCore utilization is constrained by the exp operations inside softmax.
I am exploring ways to mitigate this bottleneck and increase Tensor Core utilization. I’m wondering if it’s possible to run the GEMM and softmax operations in parallel. For instance, could I split the registers, with half performing the softmax for one half of the data, and the other half performing the GEMM for the other half of the data (such that there is no data dependency between them), and then swap tasks after?
I attempted this approach but didn’t observe any improvement in TensorCore utilization. Even though both the GEMM and the exp operations are synchronized, I think they belong to different execution pipelines, so they might be able to be parallelized? Is there any way to make them overlap, given that there is no data dependency between them?
I would greatly appreciate any insights or suggestions to help resolve this issue.
Thank you!