i want to do int8 with tensorcore, but not supported now,
so i change my data to mat16[16]
from int8 to half,
half is in local or register, but not work,
when the second param int global, it work
half mat16[16];
wmma::load_matrix_sync(a[i], &mat16[0] /* not work in local */, 0);