Cutlass usage for RTX4070 (ada)

I continue got compiling error: "/nvidia/cutlass/include\cutlass/gemm/threadblock/mma_base.h(128): error : static assertion failed with “The pipelined structure requires at least two warp-level GEMM operations.”, can anyone help me up. I am using cuda 12.6 in MSVS. It is originl code from ada_f8_gemm with tyep float_e5m2_t for both ElementA and ElemntB, tfloat32_t for output, and compiled ok. But I tend to use Tfloat32_t in A or B (mixed precision); I tried a few combinations in GemmShape for threadBlock, warp, instruction, they all complained same error. How can I reolsve it…