coalesced access of a struct of double's is this rite?

nitin.life · June 26, 2009, 6:25pm

I have struct like this

typedef struct __align__(16){		

		   double x;

		   double y;

		   double z;

		   double u;

		   double v;

		   double w;

		   double h;

		   }vec_space;

each threads access the whole struct like this…

unsigned int indx = blockIdx.x*blockDim.x+threadIdx.x; // 1 register

	 Y[indx].x = ((vec_space*)gpu_space)[indx].x;

	 Y[indx].y = ((vec_space*)gpu_space)[indx].y;

	 Y[indx].z = ((vec_space*)gpu_space)[indx].z;

	 Y[indx].u = ((vec_space*)gpu_space)[indx].u;

	 Y[indx].v = ((vec_space*)gpu_space)[indx].v;

	 Y[indx].w = ((vec_space*)gpu_space)[indx].w;

	 double h  = ((vec_space*)gpu_space)[indx].h;

where Y is a struct in shared memory… like this

typedef struct __align__(16){		

		 double x;

		 double y;

		 double z;

		 double u;

		 double v;

		 double w;

		 }shspace;

__shared__ shspace Y[8*6*Thread_Block_size];

no two threads access the same elements…

are these reads coalesced :unsure: ? ( they are double reads to shared memory I am aware about 2 wa abnk conflicts but I have 100 + flops per thread so they wont matter much…)… or can anyone help me in coming witha better way to do this ?

thansk for all the help :)

nitin.life · June 26, 2009, 9:45pm

also the variable " gpu_space " is in global memory and is passed as a void pointer to the global kernel.

nbdy who can help me ? thanks in advance

avidday · June 26, 2009, 10:10pm

That sort of array of structures cannot coalesce. It will be laid out in linear memory something like

{{x0,y0,z0,u0,v0,w0,pad,pad},

 {x1,y1,z1,u1,v1,w1,pad,pad},...

 ....}

To coalesce, each thread in a half-warp need to read a different element from a contiguous segment of suitably aligned linear memory (should be 128 bytes on a 128 byte boundary in this case). For that to happen, the data would have to be something like

{{x0,x1,x2,x3,x4,x4,x6,x7,.......,xn,pad,..},

{y0,y1,y2,y3,y4,y5,y6,y7,...,yn,pad,..},

 ...}

So that each read in the half warp (ie first x0-x7, then y0-y7,z0-z7,u0-z7,v0-v7,w0-w7, etc) comes from a contiguous block and will coalesce. Which probably means you should use a structure of arrays, rather than an array of structures.

nitin.life · June 26, 2009, 10:59pm

I SEE… thanks… very much.

I am embarrassed at my lack of applicability of C . Actually am basically a Fortran person, hence I don’t know C that much ( array of structure’s? :"> ) . If you don’t mind can you give me an example of the above stated, for a small case ? It would be very helpful , also I would reading on structure of arrays rite now.

Thanks again… :)

nitin.life · June 26, 2009, 11:25pm

Hmm I just read somthn on AOS…

okay so you mean something like this ?

typedef struct __align__(16){		

		double x[N];

		double y[N];

		double z[N];

		double u[N];

		double v[N];

		double w[N];

		double h[N];

		}vec_space;

And then read them as this

Y[indx].x = ((vec_space*)gpu_space).x[indx];

	 Y[indx].y = ((vec_space*)gpu_space).y[indx];

	 Y[indx].z = ((vec_space*)gpu_space).z[indx];

	 Y[indx].u = ((vec_space*)gpu_space).u[indx];

	 Y[indx].v = ((vec_space*)gpu_space).v[indx];

	 Y[indx].w = ((vec_space*)gpu_space).w[indx];

	 double h  = ((vec_space*)gpu_space).h[indx];

but if I am not wrong this again cross the 128 bit boundary easily as each double is a 64 bit word… So … now I am confused again… on this… :(

The solution I think , which maybe wrong (please do tell me ? )… is to use array of these “structures of arrays” but then I would have further 6 different – arrays of structs of arrays External Image . But am not sure if will solve the problem ? And how will you allocate that complicated piece of data type on gpu from host ?

Thanks for your input…

avidday · June 27, 2009, 6:11am

Hmm I just read somthn on AOS…

okay so you mean something like this ?
typedef struct __align__(16){		

		double x[N];

		double y[N];

		double z[N];

		double u[N];

		double v[N];

		double w[N];

		double h[N];

		}vec_space;
And then read them as this
Y[indx].x = ((vec_space*)gpu_space).x[indx];

	 Y[indx].y = ((vec_space*)gpu_space).y[indx];

	 Y[indx].z = ((vec_space*)gpu_space).z[indx];

	 Y[indx].u = ((vec_space*)gpu_space).u[indx];

	 Y[indx].v = ((vec_space*)gpu_space).v[indx];

	 Y[indx].w = ((vec_space*)gpu_space).w[indx];

	 double h  = ((vec_space*)gpu_space).h[indx];
but if I am not wrong this again cross the 128 bit boundary easily as each double is a 64 bit word… So … now I am confused again… on this…

The alignment requirement for the reads to coalesce is that each thread in the half warp reads a successive word from a 16 word boundary aligned segment of linear memory, so in this case the alignment requirement is 128 bytes. If you choose N to be an even multiple of 16 then the reads from that array of structures should coalesce.

nitin.life · June 27, 2009, 12:49pm

hmm yes that makes more sense (bytes) . Thanks very much for your help on this. It helped me very much.

I have one more noob External Image question… what if I have struct for a global variable like this

typedef struct __align__(16){	

		double x[N];

		}BIGspace[42];

Now in this case if each thread needs to access all 42 elements and all N threads access different 42 elements. N is a even multiple of 16 and number of threads in a block is 64 or 128.

So will this access be coalesced ? I am trying to learn this coalescing concept to the core and this doubt came to my mind when… I was thinking about the previous problem.

avidday · June 27, 2009, 2:54pm

I am sorry, but I don’t understand what access pattern you are trying to describe. The rules for coalescing are defined in detail in Chapter 5 of the programming guide that ships in the SDK.

nitin.life · June 27, 2009, 6:36pm

Hmm Yes I read that today thanks.

Sorry for the confusion. The access pattern I was talking about is something like this:

EACH THREAD does this

typedef struct __align__(16){	

		double x[N];

		}BIGspace[42];

unsigned int indx = blockIdx.x*blockDim.x+threadIdx.x;

for(int i=0;i<42;i++){

((BIGspace*)ff)[i].x[indx] = read some value from local memory or shared memory;

}

The problem is am unsure how the stuct BIGspace is laid out in memory…

avidday · June 27, 2009, 9:14pm

Those stores should coalesce. BIGspace should be laid out as 42 linear segments of N doubles, and each thread in a given half warp is storing one of 16 double words in one of those aligned segments. You should be able to put together a little prototype and confirm it using the profiler.

nitin.life · June 27, 2009, 9:33pm

Sure will do that… today…

thanks… for all the help … really appreciate it

External Image

Sarnath · June 28, 2009, 2:36pm

N I T I N,

just treat that array as a big array of integers & just copy out in 32-bit coalesced way using integer pointers…

be sure to __threadfence and __syncthreads b4 starting to use them…

after all, mem is array of bytes… u can load it anyway u want…

hth

nitin.life · June 28, 2009, 4:48pm

But my data is double type… how would integers help ? Do you mean reading each double word – int value by value ? and then later forming doubles from them ?

Thanks…

Sarnath · June 29, 2009, 5:11am

N I T I N,

After all all arrays (be it structures of doubles or arrays or whatever…) are nothing but streams of bytes…So, if u replicate that stream of bytes locally then you can still access them as structures or doubles or whatever.

For copying alone, you can copy them as integers (or charcters – but integers give better coalescing…)…

But as I can see, your data-structure in smem is NOT a replicate of the one in global memory. It has 1 double less… Grr…

If u can modify the global memory structure, the same as smem structure then you can just copy it all like integers and then use the correct type pointers to access them.

nitin.life · June 29, 2009, 10:10am

Thanks Sarnath… I understand now what you mean.

I can modify the smem structure to hold one more double…

thanks all… for the help :)

Topic		Replies	Views
How to coalesce memory access array of STRUCT CUDA Programming and Performance	17	1551	November 3, 2010
Coalesced Memory Access to Structs CUDA Programming and Performance	11	4713	September 19, 2009
Practical rules for coalesced memory access ? CUDA Programming and Performance	4	5587	September 13, 2008
Getting coalesced reads and stores using a struct CUDA Programming and Performance	3	3460	April 20, 2010
Coalesced structures possible? CUDA Programming and Performance	3	2157	February 1, 2016
Memory coalescing and multiple arrays CUDA Programming and Performance	23	11849	March 20, 2009
Loading structured data efficiently using CUDA can this be right? CUDA Programming and Performance	8	24373	November 9, 2009
Coalesced Memory access related doubt CUDA Programming and Performance	13	2172	December 9, 2010
coalesced access to global memory CUDA Programming and Performance	6	1225	May 8, 2014
Additional requirements for coalesced reads/writes structs must map to built-in type CUDA Programming and Performance	7	3194	December 31, 2007

coalesced access of a struct of double's is this rite?

Related topics