I’m doing various image processing via cuda, but the capture device outputs RGB. Due to memory alignment issues it’s much faster for cuda programs to deal with RGBA.
The naive way to convert from 24 to 32 bits is just copy and pad each texel. The problem is it will cause badly uncoalesced reads from global memory. 3 is never a multiple of a power of two so I can’t think of a way to get around this. Wondering if anyone can think of a clever trick?
Maybe i could have each thread read 4 texels worth of data (12 bytes) and write out 16 bytes, but skip the 5th texel read, and leave a hole in the destination data.
So during the first pass, you always read 1-12 bytes, 16-28 bytes, etc. This would guarantee coalesced reads from global memory (I think, haven’t checked my math yet)
Then do another pass and deal with the remaining texel. The second pass would be uncoalesced but since I would only be processing a 1/4 of the image it should over all be considerably faster.