Hi, experts
I want to implement a quantize function which just perform a quantization used by tensorrt.
my code below
__device__ __forceinline__ int8_t quantize(float val, float quantScale) {
float s = val / quantScale;
int32_t res;
asm volatile("cvt.rni.sat.s8.f32 %0, %1;" : "=r"(res) : "f"(s));
return static_cast<int8_t>(res);
}
but, after I call this function, It outputs some accury errors.
I would like to know whether my implementation is right?
Thank you.