How to coalesce memory access array of STRUCT

Hi,

We have the following STRUCT defined. As the number of properties is large, the total size of MyStruct1 can be more than 256 bytes.

typedef struct MyStruct1 {   

  double Value1;

  double Value2;

  int Value3;

  double Value5[Value5Size];	

  int Value6Length;

  double* Value6;

  int Value7Length;

  int* Value7;

  int Struct2Length;   

  MyStruct2* Struct2;   

  double TotalValue;

  bool IsGood;

};

typedef struct MyStruct2 {

	double ValueA;

	double ValueB;

	double ValueC;

	double ValueD;

	double ValueE;

};

As shown in the code, the MyStruct2 has all of its members of Double. The MyStruct1 not only have member of type Double, it also has array of Double, array of Int, array of MyStruct2. These array members are represented by Pointer and another filed specifid the length.

The original data are marshalled from C#, and constructed as an ARRAY of MyStruct1 and each instance of MySTruct1 have array of int, double and MyStruct2 as the members.

The kernel is called from the host as shown below

for(int i=0; i<NumberOfSimualtion; i++)

{

	for(int j=0; j<NumberOfSteps; j++)

	{

		 kernel1<<<60,32>>>(mystruct1, ArrayofMyStruct1Length, ArrayofRandomNumbers);

	}

}

and the kernel code is shown below.

__global__ void RunMonteCarlo(MyStruct1 *mystruct1, long ArrayofMyStruct1Length, double *randomnumbers)

{

		  int i = threadIdx.x + blockIdx.x * blockDim.x;	 	

		  int stride = blockDim.x * gridDim.x;

		   while (i < ArrayofMyStruct1Length) 

		   {

					   //work out the Value6 Return					   

					   for (int index = 0; index < mystruct1[i].Value6ength; index++)

					   {

							Value6Return += mystruct1[i].Value6[index]*mystruct1[i].Value1*randomnumbers[1];

					   }

					   

					   //work out the Value7 Return					   

					   for (int index = 0; index < mystruct1[i].Value7ength; index++)

					   {

							Value7Return += mystruct1[i].Value7[index]*mystruct1[i].Value2*randomnumbers[2];

					   }

																			 

					  //work out the Struct2 Return

					  for(int p = 0; p < mystruct1[i].Struct2Length; p++)

					  {

							MyStruct2 struct2 = mystruct1[i].Struct2[p];

							struct2Return += struct2.ValueA * struct2.ValueB * randomnumbers[3];

					   }

					   mystruct1[i].TotalValue = Value6Return + Value7Return + struct2Return;

					   i += stride;

		 }

}

The kernel is currently having not very good running performance due to the way the data is structued.

In order to coalesce memroy access to global memory, should I break the ARRAY of STRUCT into STRUCT of ARRAY. As the MyStruct1’s members are not just simple double or integer, it also haa array of double, array of MyStruct2 and the length of them are different from instance to instance, how could I handle them?

Thanks

Hi,

We have the following STRUCT defined. As the number of properties is large, the total size of MyStruct1 can be more than 256 bytes.

typedef struct MyStruct1 {   

  double Value1;

  double Value2;

  int Value3;

  double Value5[Value5Size];	

  int Value6Length;

  double* Value6;

  int Value7Length;

  int* Value7;

  int Struct2Length;   

  MyStruct2* Struct2;   

  double TotalValue;

  bool IsGood;

};

typedef struct MyStruct2 {

	double ValueA;

	double ValueB;

	double ValueC;

	double ValueD;

	double ValueE;

};

As shown in the code, the MyStruct2 has all of its members of Double. The MyStruct1 not only have member of type Double, it also has array of Double, array of Int, array of MyStruct2. These array members are represented by Pointer and another filed specifid the length.

The original data are marshalled from C#, and constructed as an ARRAY of MyStruct1 and each instance of MySTruct1 have array of int, double and MyStruct2 as the members.

The kernel is called from the host as shown below

for(int i=0; i<NumberOfSimualtion; i++)

{

	for(int j=0; j<NumberOfSteps; j++)

	{

		 kernel1<<<60,32>>>(mystruct1, ArrayofMyStruct1Length, ArrayofRandomNumbers);

	}

}

and the kernel code is shown below.

__global__ void RunMonteCarlo(MyStruct1 *mystruct1, long ArrayofMyStruct1Length, double *randomnumbers)

{

		  int i = threadIdx.x + blockIdx.x * blockDim.x;	 	

		  int stride = blockDim.x * gridDim.x;

		   while (i < ArrayofMyStruct1Length) 

		   {

					   //work out the Value6 Return					   

					   for (int index = 0; index < mystruct1[i].Value6ength; index++)

					   {

							Value6Return += mystruct1[i].Value6[index]*mystruct1[i].Value1*randomnumbers[1];

					   }

					   

					   //work out the Value7 Return					   

					   for (int index = 0; index < mystruct1[i].Value7ength; index++)

					   {

							Value7Return += mystruct1[i].Value7[index]*mystruct1[i].Value2*randomnumbers[2];

					   }

																			 

					  //work out the Struct2 Return

					  for(int p = 0; p < mystruct1[i].Struct2Length; p++)

					  {

							MyStruct2 struct2 = mystruct1[i].Struct2[p];

							struct2Return += struct2.ValueA * struct2.ValueB * randomnumbers[3];

					   }

					   mystruct1[i].TotalValue = Value6Return + Value7Return + struct2Return;

					   i += stride;

		 }

}

The kernel is currently having not very good running performance due to the way the data is structued.

In order to coalesce memroy access to global memory, should I break the ARRAY of STRUCT into STRUCT of ARRAY. As the MyStruct1’s members are not just simple double or integer, it also haa array of double, array of MyStruct2 and the length of them are different from instance to instance, how could I handle them?

Thanks

if you actually are accessing all the members of the struct at about the same time then it would benefit bandwidth usage and caching to have them all around the same memory location.
e…g. if you have 8 separate arrays of 32-bit values, and a 256-bit memory bus and cache-line size, that’s going to amount to 8 separate 256-bit memory accesses that each read/cache 256/32 = 8 elements at a time. putting the values consecutive in memory and aligned on a 256-bit boundary will change that to 1 256-bit memory that reads/caches 256/(32*8)=1 element at a time. this way you only take what you really need at the time.

the first thing i would do to improve alignment is move your biggest data elements up to the top. and then where possible make sure there’s an even number of them or even better a larger power of 2.

then you can split it up to fix any problems. for instance, your struct2 has 5 doubles. you can split it up into a struct of 4 doubles and another of 1 double.

but then there’s the problem of coalescing.

the question becomes what order are you going to read them in? if each thread is going to handle a consecutive struct, then yes, it would be much better to “stripe” your structs, so to speak. and by that i mean yes, use a struct of arrays instead. such that thread0 and thread1 are always accessing neighboring memory locations, and so on for all the threads so there’s never any bank conflicts.

and then since you are actually running at least 32 structs at a time, then you should be reading in 8 - nay, 32 elements at a time, as described in my first paragraph. thus the memory accesses feed the whole warp in parrallel i.e. are “coalesced”.

if you actually are accessing all the members of the struct at about the same time then it would benefit bandwidth usage and caching to have them all around the same memory location.
e…g. if you have 8 separate arrays of 32-bit values, and a 256-bit memory bus and cache-line size, that’s going to amount to 8 separate 256-bit memory accesses that each read/cache 256/32 = 8 elements at a time. putting the values consecutive in memory and aligned on a 256-bit boundary will change that to 1 256-bit memory that reads/caches 256/(32*8)=1 element at a time. this way you only take what you really need at the time.

the first thing i would do to improve alignment is move your biggest data elements up to the top. and then where possible make sure there’s an even number of them or even better a larger power of 2.

then you can split it up to fix any problems. for instance, your struct2 has 5 doubles. you can split it up into a struct of 4 doubles and another of 1 double.

but then there’s the problem of coalescing.

the question becomes what order are you going to read them in? if each thread is going to handle a consecutive struct, then yes, it would be much better to “stripe” your structs, so to speak. and by that i mean yes, use a struct of arrays instead. such that thread0 and thread1 are always accessing neighboring memory locations, and so on for all the threads so there’s never any bank conflicts.

and then since you are actually running at least 32 structs at a time, then you should be reading in 8 - nay, 32 elements at a time, as described in my first paragraph. thus the memory accesses feed the whole warp in parrallel i.e. are “coalesced”.

Thanks for the reply.

As each MyStruct1 can have different number of MyStruct2 in the Struct2 member, how could I translate that into an array to allow memory coalesce?

Thanks for the reply.

As each MyStruct1 can have different number of MyStruct2 in the Struct2 member, how could I translate that into an array to allow memory coalesce?

what i would do first is figure out how you’re going to be distributing the processing among the threads. if each thread has a different struct1 and they’re operating on the struct2s, well then you’re going to have tons of warp divergence and that’s no good. so i’d figure how you want to handle that first. once you get the processing order down, the memory layout should follow.

what i would do first is figure out how you’re going to be distributing the processing among the threads. if each thread has a different struct1 and they’re operating on the struct2s, well then you’re going to have tons of warp divergence and that’s no good. so i’d figure how you want to handle that first. once you get the processing order down, the memory layout should follow.

i’d say use the associative/communative property of addition, esp. if there are many struct2s for each struct1.

do two passes. first, go through all of struct2 linearly and keep a running total.

struct2[i].running_total = running_total;

   running_total += struct2[i].ValueA * struct2[i].ValueB; // * randomnumbers[3]; defer this last multiplication till later (distributive property of multiplication)

then on the second pass you deal just with struct1, and all you have to know is the running total for it’s first struct2 and it’s last, and then subtract the two.

struct2return = (struct2[struct1.last_struct2].running_total - struct2[struct1.first_struct2].running_total) * randomnumbers[3];

this eliminates any warp divergence. now you still have make the first pass data-parallel. and the second pass memory access is random which could be a latency issue (but ordered from low to high). but it’s a step in the right direction, i believe.

i’d say use the associative/communative property of addition, esp. if there are many struct2s for each struct1.

do two passes. first, go through all of struct2 linearly and keep a running total.

struct2[i].running_total = running_total;

   running_total += struct2[i].ValueA * struct2[i].ValueB; // * randomnumbers[3]; defer this last multiplication till later (distributive property of multiplication)

then on the second pass you deal just with struct1, and all you have to know is the running total for it’s first struct2 and it’s last, and then subtract the two.

struct2return = (struct2[struct1.last_struct2].running_total - struct2[struct1.first_struct2].running_total) * randomnumbers[3];

this eliminates any warp divergence. now you still have make the first pass data-parallel. and the second pass memory access is random which could be a latency issue (but ordered from low to high). but it’s a step in the right direction, i believe.

Hi,

Thanks for the reply.

The progressing logic is actually like below

__global__ void RunMonteCarlo(MyStruct1 *mystruct1, long ArrayofMyStruct1Length, double *randomnumbers)

{

		  int i = threadIdx.x + blockIdx.x * blockDim.x;		 

		  int stride = blockDim.x * gridDim.x;

		   while (i < ArrayofMyStruct1Length) 

		   {

					   //work out the Value6 Return					   

					   for (int index = 0; index < mystruct1[i].Value6ength; index++)

					   {

							Value6Return += mystruct1[i].Value6[index]*mystruct1[i].Value1*randomnumbers[1];

					   }

					   

					   //work out the Value7 Return					   

					   for (int index = 0; index < mystruct1[i].Value7ength; index++)

					   {

							Value7Return += mystruct1[i].Value7[index]*mystruct1[i].Value2*randomnumbers[2];

					   }

						   

					  mystruct1[i].TotalValue = Value6Return + Value7Return;

					  if(mystruct1[i].TotalValue > mystruct1[i].Value1)

					  {

							//work out the Struct2 Return

						   for(int p = 0; p < mystruct1[i].Struct2Length; p++)

						   {

								MyStruct2 struct2 = mystruct1[i].Struct2[p];

								struct2Return += struct2.ValueA * struct2.ValueB * randomnumbers[3];

						   }

					  }

					   

					   i += stride;

		 }

}

As shown in the code, the if statment if(mystruct1[i].TotalValue > mystruct1[i].Value1) will cause different path within the warp.

How could I make the memroy access coalesce?

Hi,

Thanks for the reply.

The progressing logic is actually like below

__global__ void RunMonteCarlo(MyStruct1 *mystruct1, long ArrayofMyStruct1Length, double *randomnumbers)

{

		  int i = threadIdx.x + blockIdx.x * blockDim.x;		 

		  int stride = blockDim.x * gridDim.x;

		   while (i < ArrayofMyStruct1Length) 

		   {

					   //work out the Value6 Return					   

					   for (int index = 0; index < mystruct1[i].Value6ength; index++)

					   {

							Value6Return += mystruct1[i].Value6[index]*mystruct1[i].Value1*randomnumbers[1];

					   }

					   

					   //work out the Value7 Return					   

					   for (int index = 0; index < mystruct1[i].Value7ength; index++)

					   {

							Value7Return += mystruct1[i].Value7[index]*mystruct1[i].Value2*randomnumbers[2];

					   }

						   

					  mystruct1[i].TotalValue = Value6Return + Value7Return;

					  if(mystruct1[i].TotalValue > mystruct1[i].Value1)

					  {

							//work out the Struct2 Return

						   for(int p = 0; p < mystruct1[i].Struct2Length; p++)

						   {

								MyStruct2 struct2 = mystruct1[i].Struct2[p];

								struct2Return += struct2.ValueA * struct2.ValueB * randomnumbers[3];

						   }

					  }

					   

					   i += stride;

		 }

}

As shown in the code, the if statment if(mystruct1[i].TotalValue > mystruct1[i].Value1) will cause different path within the warp.

How could I make the memroy access coalesce?

unless Value6ength & 7ength is the same for all struct1’s, you’re getting warp divergence there, too. and if it is the same then you shouldn’t be repeating it all over the place as that wastes bandwidth and space.

that’s why i think you’ll really have to do a more radical transformation to your memory layout, including breaking your structs up into multiple arrays.

and possibly breaking the kernel up into multiple passes.

unless Value6ength & 7ength is the same for all struct1’s, you’re getting warp divergence there, too. and if it is the same then you shouldn’t be repeating it all over the place as that wastes bandwidth and space.

that’s why i think you’ll really have to do a more radical transformation to your memory layout, including breaking your structs up into multiple arrays.

and possibly breaking the kernel up into multiple passes.

put all the struct2’s in a single linear array, then have struct1 point to an index in the array on where to start, and either how many elements from there or the last index in the array it uses. i.e. have struct1 store “pointers” into a linear array.

put all the struct2’s in a single linear array, then have struct1 point to an index in the array on where to start, and either how many elements from there or the last index in the array it uses. i.e. have struct1 store “pointers” into a linear array.