question concerning data alignment

Hello, I am a newbie in cuda programming and have a hopefully not to stupid question.
I am working with RGB images which I store into a texture. Pixel data are not yet 32 bit aligned.
I do not want to do the alignment conversion manually. Is there a possibility to align the data directly when doing the copy to device memory?


Through CUDA-GL interoperability you could do it, but I think doing the conversion manually is more efficient if the results aren’t in a GL texture or framebuffer to begin with.

What about storing your data as RGBX throughout your entire program? It’s more efficient, even for CPU.

Hello and thanks. It seems I have to do the conversion by hand than.

I’m also working on RGB images a lot.
What worked best for me is to convert the image to RGBA on the GPU.
The value for A is always 0 since it’s just a pseudo channel.

To do the converstion on the gpu also saves a lot of up and download bandwidth plus you get 32-bit memory alignment.

If you really need to work with RGB, for example float3, you can get coalesced reads/writes if you use shared memory. Just read the data as 3x floats into smem, sync, process, then write back as floats again. The idea is that as long as the starting address is aligned, you can do three reads by each thread, properly offset, and get coalescing on all. Throughput you get is the same as any coalesced access, there’s no perf degradation due to read/write via smem.


Hello, I am not shure if I got that right or we speak about the same. My RGB values are of type unsigned char.

Therefore I thought alignment to RGBA and reading each pixel as uchar4 is the only way to get coalesced read/writes.


Read 4 bytes per thread into shared memory. So first thread will read RGBR. Next thread will read GBRG. And the last thread will read BRGB. These are coalesced reads. Now sync.

Once they are in shared memory, you can access the RGB data relatively more quickly than you can through global memory. Do your computation and then reverse the process when writing back to global memory.

Ah, I did miss that. In that case you can take the approach sphyraena mentions above. It’ll be straightforward as long as number of threads per block is a multiple of 4.


Thanks for the reply, this is how I finally did it.