SDPA example in CublasDX

For the scaled_dot_prod_attn_batched example’s performance results compared with SDPA/Pytorch, could give more details on how to reproduce these results? Like the head-dimension size is not given.

I tried seqlen=64 in H800(used head-dim size = 64), the cycle number for flash-attention used by Transformer Engine is 10828, but I got 13106 by using cublasDX, this is far from declared in below figure:

image

https://docs.nvidia.com/cuda/cublasdx/examples.html#advanced-examples

actually the 23.04 docker’s flash-attention implementation doesn’t support H100, please check the setup details.