CUTLASS 1-bit Tensorcore GEMM result error on SM86

Hi, All

I found that when I compile the following 1-bit tensorcore GEMM for SM86 by CUDA 11.1 on RTX3090,

using ElementOutput = int32_t;
using ElementAccumulator = int32_t;
using ElementCompute = int32_t;

using Gemm = cutlass::gemm::device::Gemm<
      cutlass::uint1b_t, cutlass::layout::RowMajor, cutlass::uint1b_t,
      cutlass::layout::ColumnMajor, ElementOutput, cutlass::layout::RowMajor,
      ElementAccumulator, cutlass::arch::OpClassTensorOp, cutlass::arch::Sm75,
      cutlass::gemm::GemmShape<128, 256, 512>,
      cutlass::gemm::GemmShape<64, 64, 512>,
      cutlass::gemm::GemmShape<8, 8, 128>,
      cutlass::epilogue::thread::LinearCombination<
          ElementOutput, 128 / cutlass::sizeof_bits<ElementOutput>::value,
          ElementAccumulator, ElementCompute>,
      cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 2, 128, 128,
      false, cutlass::arch::OpXorPopc>; 

The output result is always wrong for the matrix with N=M=K < 8192, i.e., Failed to pass the verification .
While for N=M=K >= 8192, such as 8192, 16384, it can pass the verification test.

One more thing to note is that if I choose the GEMM test example based on SM80, such as the one shown below,
It will always trigger the cutlass internal runtime error during the GEMM initialization. While for SM75 it won’t trigger such problem only has the results correctness concern.

using ElementOutput = int32_t;
using ElementAccumulator = int32_t;
using ElementCompute = int32_t;

using Gemm = cutlass::gemm::device::Gemm<
    cutlass::uint1b_t, cutlass::layout::RowMajor, cutlass::uint1b_t,
    cutlass::layout::ColumnMajor, ElementOutput, cutlass::layout::ColumnMajor,
    ElementAccumulator, cutlass::arch::OpClassTensorOp, cutlass::arch::Sm80,
    cutlass::gemm::GemmShape<128, 256, 1024>,
    cutlass::gemm::GemmShape<64, 64, 1024>,
    cutlass::gemm::GemmShape<16, 8, 256>,
    cutlass::epilogue::thread::LinearCombination<
        ElementOutput, 128 / cutlass::sizeof_bits<ElementOutput>::value,
        ElementAccumulator, ElementCompute>,
    cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 3, 128, 128,
    false, cutlass::arch::OpXorPopc>;