Hi, I’ve encountered an problem while working on GPU processing with reduced precision(i.e. 4-bit integer, 8-bit integer).
I have eight 4-bit integer elements packed on an 32-bit integer element of shared memory array.
However, they are not arranged in a way I want them to be.
[Figure 1] shows data arrangement of a current shared memory array. (Let’s say the size of shared memory array with 32-bit datatype is 64). Each 32-bit element of the shared memory contains eight 4-bit integer data. I want each 4-bit data with the same color to be consecutive on the shared memory array as [Figure 2].
As I’m not really familiar with the implementation of inter-thread communication using CUDA, it is quite difficult to come up with an idea on how to gather and distribute data safely among threads. I wonder if this problem is even solvable with CUDA programming as 32 threads works together as a single warp.
I want the CUDA implementation to be 1)Data-safe, 2)Fairly effective and 3)Well-parallelized.