I’m porting a program to CUDA that uses long arrays of 40 bit (5 byte) words.
x86 has no problem and not even a performance penalty for reading/writing 64 bit words
with a 5 byte stride. In order to perform these on CUDA, I have for now resorted to the
following pair of routines:
device void write40(u8 *p64, const u64 x) {
u32 off = (u64)p64 & 3;
u32 *a = (u32 *)(p64 - off);
const u64 y = (u64)a[1] << 32 | a[0];
int s = 8 * off;
u64 mask = 0xffffffffffULL << s;
const u64 z = (y & ~mask) | (x << s);
a[0] = z;
a[1] = z >> 32;
}
device u64 read40(const u8 *p64) {
u32 off = (u64)p64 & 3;
const u32 *a = (u32 *)(p64 - off);
const u32 lo = a[0];
const u32 hi = a[1];
return (((u64)hi << 32) | lo) >> (8 * off);
}
Not being a CUDA expert, I wonder to what extent these routines can be optimized.
Also, I wonder how much slower these are compared to aligned 64 bit reads/writes.
Please let me know if you can shed light on these questions.
regards,
-John