question concerning data alignment

Christoph_John · January 3, 2008, 12:50pm

Hello, I am a newbie in cuda programming and have a hopefully not to stupid question.
I am working with RGB images which I store into a texture. Pixel data are not yet 32 bit aligned.
I do not want to do the alignment conversion manually. Is there a possibility to align the data directly when doing the copy to device memory?

Thanks
Christoph

wumpus · January 3, 2008, 1:05pm

Through CUDA-GL interoperability you could do it, but I think doing the conversion manually is more efficient if the results aren’t in a GL texture or framebuffer to begin with.

What about storing your data as RGBX throughout your entire program? It’s more efficient, even for CPU.

Christoph_John · January 4, 2008, 10:19pm

Hello and thanks. It seems I have to do the conversion by hand than.

VanDammage · January 5, 2008, 12:31pm

I’m also working on RGB images a lot.
What worked best for me is to convert the image to RGBA on the GPU.
The value for A is always 0 since it’s just a pseudo channel.

To do the converstion on the gpu also saves a lot of up and download bandwidth plus you get 32-bit memory alignment.

paulius · January 6, 2008, 12:48am

If you really need to work with RGB, for example float3, you can get coalesced reads/writes if you use shared memory. Just read the data as 3x floats into smem, sync, process, then write back as floats again. The idea is that as long as the starting address is aligned, you can do three reads by each thread, properly offset, and get coalescing on all. Throughput you get is the same as any coalesced access, there’s no perf degradation due to read/write via smem.

Paulius

Christoph_John · January 6, 2008, 1:33pm

Hello, I am not shure if I got that right or we speak about the same. My RGB values are of type unsigned char.

Therefore I thought alignment to RGBA and reading each pixel as uchar4 is the only way to get coalesced read/writes.

Christoph

sphyraena · January 6, 2008, 1:46pm

Read 4 bytes per thread into shared memory. So first thread will read RGBR. Next thread will read GBRG. And the last thread will read BRGB. These are coalesced reads. Now sync.

Once they are in shared memory, you can access the RGB data relatively more quickly than you can through global memory. Do your computation and then reverse the process when writing back to global memory.

paulius · January 7, 2008, 2:11am

Ah, I did miss that. In that case you can take the approach sphyraena mentions above. It’ll be straightforward as long as number of threads per block is a multiple of 4.

Paulius

Christoph_John · January 7, 2008, 3:37pm

Thanks for the reply, this is how I finally did it.

Christoph

Topic		Replies	Views
Converting RGB to RGBA CUDA Programming and Performance	1	3087	February 7, 2008
Best access patterns for 8bit data on Compute 1.0/1.1 hardware CUDA Programming and Performance	3	4862	January 26, 2009
Color Image Processing Efficient 24bit RGB image access CUDA Programming and Performance	1	3964	November 26, 2009
A question regarding the speed CUDA Programming and Performance	3	5719	September 5, 2011
Memory access coalescing Vs. the compiler CUDA Programming and Performance	2	2457	July 23, 2007
why is it uncoalesced ? SDK example simpleGL CUDA Programming and Performance	9	13692	February 3, 2011
Memory coalescing and multiple arrays CUDA Programming and Performance	23	11783	March 20, 2009
Help me about coalescing my program run too slow CUDA Programming and Performance	5	2939	May 14, 2008
Coalescing - beginner question CUDA Programming and Performance	10	1798	June 23, 2010
Coalesced Access Is this coalesced? CUDA Programming and Performance	7	2593	June 17, 2008

question concerning data alignment

Related topics