Hi,
I was wondering if there is any self contained, compileable sample code out there that shows multiplying signed/unsigned 8, 4 or 1 bit integers on the new Turing tensor cores with 32 bit accumulation.
All I’ve seen so far is short snippets on a few AI conference slides, such as
__device__ void tensor_op_16_16_16(char *a, char *b, int *c)
{
wmma::fragment<wmma::matrix_a, 16, 16, 16, char, ...> a_frag;
wmma::fragment<wmma::matrix_b, 16, 16, 16, char, ...> b_frag;
wmma::fragment<wmma::accumulator, 16, 16, 16, int, ...> c_frag;
}
wmma::load_matrix_sync(a_frag, a, ...);
wmma::load_matrix_sync(b_frag, b, ...);
wmma::fill_fragment(c_frag, 0.0f);
wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);
wmma::store_matrix_sync(c, c_frag, ...);
Christian