FP8 WMMA kernel compilation error

I am trying to compile a simpler wmma kernel with FP8 data type on H100. Following is my example code,

GPU: H100
CUDA version 12.0
gcc version 11.3

header file

#include <cuda.h>
#include <mma.h>
#include <cuda_fp8.h>
#include <cuda_fp8.hpp>
using namespace nvcuda;

/* Performs a 16x16x16 warp matmul with input types InT
 * and accumulator type AccT. This uses tensorcores.

typedef int8_t INT8;
typedef __nv_fp8_e4m3 D8; // 16-bit float on device
//typedef half D8;

__global__ void wmma_ker(D8 *a, D8 *b, float *c) {
   // Declare the fragments
   wmma::fragment<wmma::matrix_a, 16, 16, 16, D8, wmma::row_major> a_frag;
   wmma::fragment<wmma::matrix_b, 16, 16, 16, D8, wmma::row_major> b_frag;
   wmma::fragment<wmma::accumulator, 16, 16, 16, float> c_frag;

   // Initialize the output to zero
   wmma::fill_fragment(c_frag, 0.0f);

   // Load the inputs
   wmma::load_matrix_sync(a_frag, a, 16);
   wmma::load_matrix_sync(b_frag, b, 16);

   // Perform the matrix multiplication
   wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);

   // Store the output
   wmma::store_matrix_sync(c, c_frag, 16, wmma::mem_row_major);

cu file looks like

#include <stdio.h>
#include <stdlib.h>
#include <cmath>
#include <assert.h>
#include <curand.h>
#include <stdint.h>
#include "tu_test_e4m3.cuh"

typedef union {
    INT8 _i;
    D8 _f;
} H8;  // 8-bit float on host

const int DIM = 16;

#define cudaErrCheck(stat) { cudaErrCheck_((stat), __FILE__, __LINE__); }
void cudaErrCheck_(cudaError_t stat, const char *file, int line) {
    if (stat != cudaSuccess) {
        fprintf(stderr, "CUDA error: %s %s %d\n", cudaGetErrorString(stat), file, line);

int main(int argc, char** argv) {
    // We need a 1-byte int type to treat as FP8
    assert(sizeof(INT8) == 1);
    assert(sizeof(D8) == 1);
    assert(sizeof(H8) == 1);

    cudaEvent_t start_wmma;
    cudaEvent_t stop_wmma;

    // Allocate A, B, C on host
    H8 *A, *B;
    float *C;
    A = (H8*)malloc(DIM*DIM*sizeof(H8));
    B = (H8*)malloc(DIM*DIM*sizeof(H8));
    C = (float*)malloc(DIM*DIM*sizeof(float));
    memset(A, 0, sizeof(A));

    // Allocate A,B,C on device
    D8 *d_A, *d_B;
    float *d_C;
    cudaMalloc((void**)&d_A, DIM*DIM*sizeof(D8));
    cudaMalloc((void**)&d_B, DIM*DIM*sizeof(D8));
    cudaMalloc((void**)&d_C, DIM*DIM*sizeof(float));

    for(int i = 0; i<DIM; i++) {
            A[i]._i = 1;
    // Copy to device
    cudaMemcpy(d_A, A, DIM*DIM*sizeof(D8), cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, B, DIM*DIM*sizeof(D8), cudaMemcpyHostToDevice);

    // Do the matmul
    dim3 threads = dim3(DIM,DIM);
    dim3 blocks = dim3(1,1);
    wmma_ker<<<blocks,threads>>>(d_A, d_B, d_C);

    // Copy back
    cudaMemcpy(C, d_C, DIM*DIM*sizeof(float), cudaMemcpyDeviceToHost);



At compilation I get following error

nvcc -arch=sm_90 tu_test_e4m3.s.cu -o tu_test_e4m3
tu_test_e4m3.cuh(17): error: incomplete type is not allowed

tu_test_e4m3.cuh(18): error: incomplete type is not allowed

2 errors detected in the compilation of "tu_test_e4m3.s.cu".
make: *** [Makefile:2: tu_test_e4m3] Error 2

Any help will be appreciated.

  1. Please include a complete code.
  2. Please don’t post pictures of text on this forum.
  3. Please properly format code on this forum. A simple method: edit your post by clicking on the pencil icon under it. Select the code in the edit window. Press the </> button at the top of the edit window. Save your changes.

updated the post. Would appreciate the help. If you copy this header and source code into files, you should be able to reproduce the error.

mma.h (i.e. /usr/local/cuda/include/crt/mma.h) in CUDA 12.0 doesn’t seem to have any mma definition overloads for fp8 yet. May have to wait for a future release. It’s possible that CUDA 12.1 has some, I haven’t checked. I don’t see them in CUDA 12.1 either. I don’t have information about when things may appear in the future.

There are fp8 ptx instructions exposed.

It is interesting that you say it’s not supported. I have an fp8 matmul implementation which leverages cublasLt and it seems to work. Wondering how cublas fp8 kernel is written?

I don’t think I said “its not supported”. I pointed out that a particular header file didn’t seem to have the necessary overloads (yet), and also that it seemed to be available via PTX.

I guess cublasLt is using PTX directly then? Cause cublasLt included as part of cuda12.0 is working for fp8 matmul.

I think that is one possibility, anyway.

Is there a way to raise issues with CUBLAS team? I am seeing other issues with certain matrix sizes with cublasLt.

You can file a bug. If you don’t provide a complete repro case, you will probably be asked for one.