I have a question - is there a simple way to pack 4 unsigned char-s into one register?
If I was coding on CPU I would, for example, code it by bit shifts and masks:
var1 would be (reg1 & 0xff)
var2 - ((reg1 >> 8) & 0xff)
var3 - ((reg1 >> 16) & 0xff)
var4 - ((reg1 >> 24) & 0xff)
reg1 is declared as unsigned int.
However when I did that in my, otherwise working, code - it crashes!
Maybe you have an idea what is going wrong?
Or maybe another way to pack 4 such variables into one register?
Register count is currently my only limiting facter for one of my slow, memory-bound kernels and want to increase occupancy of it to (hopefully) better hide latencies.