sample code for Integer arithmetics in RTX tensor cores?


I was wondering if there is any self contained, compileable sample code out there that shows multiplying signed/unsigned 8, 4 or 1 bit integers on the new Turing tensor cores with 32 bit accumulation.

All I’ve seen so far is short snippets on a few AI conference slides, such as

__device__ void tensor_op_16_16_16(char *a, char *b, int *c)
wmma::fragment<wmma::matrix_a, 16, 16, 16, char, ...> a_frag;
wmma::fragment<wmma::matrix_b, 16, 16, 16, char, ...> b_frag;
wmma::fragment<wmma::accumulator, 16, 16, 16, int, ...> c_frag;
wmma::load_matrix_sync(a_frag, a, ...);
wmma::load_matrix_sync(b_frag, b, ...);
wmma::fill_fragment(c_frag, 0.0f);
wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);
wmma::store_matrix_sync(c, c_frag, ...);


not exactly what you are asking for, but there is additional detail in the programming guide:

I guess I will have a closer look at the internals of the Cutlass 1.2 library.