Unaligned memory load

I’m working on 8-bit pixel processing and I want to move some global memory into shared memory for later access speed. I would also like to operate on 4 pixels at a time to gain the benifits of coalescing the transfer from global to shared memory. Since I can load from any 8-bit offset into the global memory, I would like to perform an unaligned load as I can using intrinsics available on DSPs. Is there such an operation in CUDA, or do I need to resort to shifts and or’s?

Thanks,
Peter

I don’t think I understand the question. By unaligned, do you mean a non-byte aligned group of bits? Extracting odd bits 3:10 from an int requires shifting and masking.

Or, if you mean loading larger types, loading of a 32 bit integer requires the pointer to be 4-byte aligned. Loading of a float4 requires that the pointer be 16-byte aligned. Misalignment can cause garbled data or crashes. I don’t think there is a way to get around this other than manually shuffling the bytes around yourself.

Thanks for the reply, and I’m trying to do the larger loads. I have an image of 8-bit pixels in global memory where I might want to start processing at any offset (lets say address 0x1 for simplicity). From what I’ve read in this forum and from my own experiments, it is much more efficient to clump accesses together in blocks of 4 bytes, transfer to shared memory, and then process. My shared memory is aligned on 32-bit boundaries, so I can work with 4 pixels in a dword, but first I need to transfer the 4 pixels from global memory to shared memory from address 0x1. To do this I’m currently shifting and or’ing the 4 bytes into the shared memory aligned 32-bit dword, but I was hoping for an unaligned access instruction like what I’ve seen on DSPs. If there is no such beast, then I’ll just stick with the shift and or technique.

Thanks,

Peter