coalesced access of a struct of double's is this rite?

I have struct like this

typedef struct __align__(16){		

		   double x;

		   double y;

		   double z;

		   double u;

		   double v;

		   double w;

		   double h;

		   }vec_space;

each threads access the whole struct like this…

unsigned int indx = blockIdx.x*blockDim.x+threadIdx.x; // 1 register

	 Y[indx].x = ((vec_space*)gpu_space)[indx].x;

	 Y[indx].y = ((vec_space*)gpu_space)[indx].y;

	 Y[indx].z = ((vec_space*)gpu_space)[indx].z;

	 Y[indx].u = ((vec_space*)gpu_space)[indx].u;

	 Y[indx].v = ((vec_space*)gpu_space)[indx].v;

	 Y[indx].w = ((vec_space*)gpu_space)[indx].w;

	 double h  = ((vec_space*)gpu_space)[indx].h;

where Y is a struct in shared memory… like this

typedef struct __align__(16){		

		 double x;

		 double y;

		 double z;

		 double u;

		 double v;

		 double w;

		 }shspace;

__shared__ shspace Y[8*6*Thread_Block_size];

no two threads access the same elements…

are these reads coalesced :unsure: ? ( they are double reads to shared memory I am aware about 2 wa abnk conflicts but I have 100 + flops per thread so they wont matter much…)… or can anyone help me in coming witha better way to do this ?

thansk for all the help :)

also the variable " gpu_space " is in global memory and is passed as a void pointer to the global kernel.

nbdy who can help me ? thanks in advance

That sort of array of structures cannot coalesce. It will be laid out in linear memory something like

{{x0,y0,z0,u0,v0,w0,pad,pad},

 {x1,y1,z1,u1,v1,w1,pad,pad},...

 ....}

To coalesce, each thread in a half-warp need to read a different element from a contiguous segment of suitably aligned linear memory (should be 128 bytes on a 128 byte boundary in this case). For that to happen, the data would have to be something like

{{x0,x1,x2,x3,x4,x4,x6,x7,.......,xn,pad,..},

{y0,y1,y2,y3,y4,y5,y6,y7,...,yn,pad,..},

 ...}

So that each read in the half warp (ie first x0-x7, then y0-y7,z0-z7,u0-z7,v0-v7,w0-w7, etc) comes from a contiguous block and will coalesce. Which probably means you should use a structure of arrays, rather than an array of structures.

I SEE… thanks… very much.

I am embarrassed at my lack of applicability of C . Actually am basically a Fortran person, hence I don’t know C that much ( array of structure’s? :"> ) . If you don’t mind can you give me an example of the above stated, for a small case ? It would be very helpful , also I would reading on structure of arrays rite now.

Thanks again… :)

Hmm I just read somthn on AOS…

okay so you mean something like this ?

typedef struct __align__(16){		

		double x[N];

		double y[N];

		double z[N];

		double u[N];

		double v[N];

		double w[N];

		double h[N];

		}vec_space;

And then read them as this

Y[indx].x = ((vec_space*)gpu_space).x[indx];

	 Y[indx].y = ((vec_space*)gpu_space).y[indx];

	 Y[indx].z = ((vec_space*)gpu_space).z[indx];

	 Y[indx].u = ((vec_space*)gpu_space).u[indx];

	 Y[indx].v = ((vec_space*)gpu_space).v[indx];

	 Y[indx].w = ((vec_space*)gpu_space).w[indx];

	 double h  = ((vec_space*)gpu_space).h[indx];

but if I am not wrong this again cross the 128 bit boundary easily as each double is a 64 bit word… So … now I am confused again… on this… :(

The solution I think , which maybe wrong (please do tell me ? )… is to use array of these “structures of arrays” but then I would have further 6 different – arrays of structs of arrays External Image . But am not sure if will solve the problem ? And how will you allocate that complicated piece of data type on gpu from host ?

Thanks for your input…

The alignment requirement for the reads to coalesce is that each thread in the half warp reads a successive word from a 16 word boundary aligned segment of linear memory, so in this case the alignment requirement is 128 bytes. If you choose N to be an even multiple of 16 then the reads from that array of structures should coalesce.

hmm yes that makes more sense (bytes) . Thanks very much for your help on this. It helped me very much.

I have one more noob External Image question… what if I have struct for a global variable like this

typedef struct __align__(16){	

		double x[N];

		}BIGspace[42];

Now in this case if each thread needs to access all 42 elements and all N threads access different 42 elements. N is a even multiple of 16 and number of threads in a block is 64 or 128.

So will this access be coalesced ? I am trying to learn this coalescing concept to the core and this doubt came to my mind when… I was thinking about the previous problem.

I am sorry, but I don’t understand what access pattern you are trying to describe. The rules for coalescing are defined in detail in Chapter 5 of the programming guide that ships in the SDK.

Hmm Yes I read that today thanks.

Sorry for the confusion. The access pattern I was talking about is something like this:

EACH THREAD does this

typedef struct __align__(16){	

		double x[N];

		}BIGspace[42];

unsigned int indx = blockIdx.x*blockDim.x+threadIdx.x;

for(int i=0;i<42;i++){

((BIGspace*)ff)[i].x[indx] = read some value from local memory or shared memory;

}

The problem is am unsure how the stuct BIGspace is laid out in memory…

Those stores should coalesce. BIGspace should be laid out as 42 linear segments of N doubles, and each thread in a given half warp is storing one of 16 double words in one of those aligned segments. You should be able to put together a little prototype and confirm it using the profiler.

Sure will do that… today…

thanks… for all the help … really appreciate it

External Image

N I T I N,

just treat that array as a big array of integers & just copy out in 32-bit coalesced way using integer pointers…

be sure to __threadfence and __syncthreads b4 starting to use them…

after all, mem is array of bytes… u can load it anyway u want…

hth

But my data is double type… how would integers help ? Do you mean reading each double word – int value by value ? and then later forming doubles from them ?

Thanks…

N I T I N,

After all all arrays (be it structures of doubles or arrays or whatever…) are nothing but streams of bytes…So, if u replicate that stream of bytes locally then you can still access them as structures or doubles or whatever.

For copying alone, you can copy them as integers (or charcters – but integers give better coalescing…)…

But as I can see, your data-structure in smem is NOT a replicate of the one in global memory. It has 1 double less… Grr…

If u can modify the global memory structure, the same as smem structure then you can just copy it all like integers and then use the correct type pointers to access them.

Thanks Sarnath… I understand now what you mean.

I can modify the smem structure to hold one more double…

thanks all… for the help :)