Hi,
I have a big matrix of int8 data into the shared memory, which I would like to multiply with a fp16 matrix (also in shared memory) using tensor cores.
Do you know if there is any way to load the int8 data directly to a fp16 wmma::fragment without needing to translate the int8 data to fp16 data in the shared memory, and then loading the fragment?
Many thanks,