Verifying claimed TOPS performance on Jetson Thor – CUTLASS kernel for SM110 does not run, SM80 gives very low performance (~6.9 TFLOP/s)

Thanks for your reply.
Here are still 2 questions.

  1. What you mean is that Thor doesn’t support MXFP4?
  2. I am reading the PTX docs and the command tcgen05. mma. cta_group. kind. block_stcale {. scale-vectorsize} indicates that . scale-vectorsize can only be used with sm_100a, sm_100f, and sm110f, but thor is sm_110a. But when the data type is . kind: mxf4nvf4, K is at least 64. I want to confirm if the . scale-vectorsize parameter is available on Thor? Thank you again.
    here is the docs:
    https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#tcgen05-mma-instructions-mma