I currently have two (very large) matrices which are defined as float* type.
The data in each matrix, however, does not require 32-bit precision.
For matrix “A” data is of integer type and ranges between (-1) to 80.
For matrix “B” data is of real (non-integer) type and ranges between 0 to 99, with a required precision of about two decimal places.
To save up on memory and bandwidth I wish to pack the two matrices into a single float* matrix (which should be doable as the packed values hardly even occupy a single fp16 type). The packed matrix will then be copied to global memory on the device, and accessed (and unpacked) from within the kernel.
Note that packing must take place on the host, whereas unpacking must take place inside the kernel (on the device).
I know that this is possible using fp16 arithmetic, but not really sure how to accomplish it.
Any help would be greatly appreciated!