global memory update overhead

Hi, we have an array of STRUCT PurchaseOrder in global memory.

typedef struct PurchaseOrder

{

  double value1;

  double value2;

  double value 3;

double total;

}

PurchaseOrder *purchaseorder;

We passed this array of STRUCT by reference(pointer) into our kernel function.

global kernel(PurchaseOrder *purchaseOrder)

In the kernel, we calculate the sum of members of STRUCT and write the total back into another member of the STRUCT.

purchaseOrder[0].total += purchaseOrder[0].value1 + purchaseOrder[0].value2 + purchaseOrder[0].value3;

We made the array to contain one PurchaseOrder only to avoid the multiple threads trying to access global memory at the same time.

It runs okey with the line for assigning the value to the total. However, it took 250 ms.

if we commented the line out for assigning the value to the total, it only took 0.2ms.

Where is that overhead coming from?

Thanks

Hi, we have an array of STRUCT PurchaseOrder in global memory.

typedef struct PurchaseOrder

{

  double value1;

  double value2;

  double value 3;

double total;

}

PurchaseOrder *purchaseorder;

We passed this array of STRUCT by reference(pointer) into our kernel function.

global kernel(PurchaseOrder *purchaseOrder)

In the kernel, we calculate the sum of members of STRUCT and write the total back into another member of the STRUCT.

purchaseOrder[0].total += purchaseOrder[0].value1 + purchaseOrder[0].value2 + purchaseOrder[0].value3;

We made the array to contain one PurchaseOrder only to avoid the multiple threads trying to access global memory at the same time.

It runs okey with the line for assigning the value to the total. However, it took 250 ms.

if we commented the line out for assigning the value to the total, it only took 0.2ms.

Where is that overhead coming from?

Thanks

The difference in the two execution times is due to compiler optimization. By removing the write to global memory from the code, the compiler is smart enough to optimize all of the preceding calculations leading to the result away (“dead code removal”).

It is impossible to say more without seeing some actual code, but it is likely that the underlying poor performance is caused by uncoalesced memory access.

The difference in the two execution times is due to compiler optimization. By removing the write to global memory from the code, the compiler is smart enough to optimize all of the preceding calculations leading to the result away (“dead code removal”).

It is impossible to say more without seeing some actual code, but it is likely that the underlying poor performance is caused by uncoalesced memory access.

Hi avidday,

Please see the code below. The line with block comments is causing the problem. I will take the Simulation and Steps loops out later (the advice you gave in another post) but they are all set to 1 at this moment.

__global__ void RunMonteCarlo(PurchaseOrder *purchaseorders, long devPurchaseOrderLength, SimulationOptions hostSimulationOption)

	{

		int i = threadIdx.x + blockIdx.x * blockDim.x;		 

		int stride = blockDim.x * gridDim.x;

		while (i < devPurchaseOrderLength) 

		{		

		   // The loop for each simulation

		   for(int j = 0; j < hostSimulationOptions.NumberofSimulation; j++)

		   {				 

				 // The loop for each step

				 for(int k = 0; k < hostSimulationOptions.NumberofStep; k++)

				 {

					   //work out the Value1 Return

					   for (int index = 0; index < 15; index++)

					   {

							Value1Return += purchaseorders[i].PartialValue1[index]*1;

					   }

					   //work out the Value2 Return					   

					   for (int index = 0; index < purchaseorders[i].PartialValue2Length; index++)

					   {

							Value2Return += purchaseorders[i].PartialValue2[Value2]*2;

					   }

					   

					   //work out the Value3 Return						

					   for (int index = 0; index < purchaseorders[i].PartialValue3Length; index++)

					   {

							Value3Return += purchaseorders[i].PartialValue3[Value3]*3;

					   }					   

				  

																				  /* This line is causing huge overhead */

					  purchaseorders[i].TotalReturn += Value1Return + Value2Return + Value3Return;	 				

				  }		 

			}

			i += stride;

		}

	}

Hi avidday,

Please see the code below. The line with block comments is causing the problem. I will take the Simulation and Steps loops out later (the advice you gave in another post) but they are all set to 1 at this moment.

__global__ void RunMonteCarlo(PurchaseOrder *purchaseorders, long devPurchaseOrderLength, SimulationOptions hostSimulationOption)

	{

		int i = threadIdx.x + blockIdx.x * blockDim.x;		 

		int stride = blockDim.x * gridDim.x;

		while (i < devPurchaseOrderLength) 

		{		

		   // The loop for each simulation

		   for(int j = 0; j < hostSimulationOptions.NumberofSimulation; j++)

		   {				 

				 // The loop for each step

				 for(int k = 0; k < hostSimulationOptions.NumberofStep; k++)

				 {

					   //work out the Value1 Return

					   for (int index = 0; index < 15; index++)

					   {

							Value1Return += purchaseorders[i].PartialValue1[index]*1;

					   }

					   //work out the Value2 Return					   

					   for (int index = 0; index < purchaseorders[i].PartialValue2Length; index++)

					   {

							Value2Return += purchaseorders[i].PartialValue2[Value2]*2;

					   }

					   

					   //work out the Value3 Return						

					   for (int index = 0; index < purchaseorders[i].PartialValue3Length; index++)

					   {

							Value3Return += purchaseorders[i].PartialValue3[Value3]*3;

					   }					   

				  

																				  /* This line is causing huge overhead */

					  purchaseorders[i].TotalReturn += Value1Return + Value2Return + Value3Return;	 				

				  }		 

			}

			i += stride;

		}

	}

As I said, taking out this line:

purchaseorders[i].TotalReturn += Value1Return + Value2Return + Value3Return;

makes all of the other code in the kernel redundant. The compiler will probably strip the entire kernel out to be empty without it.

As for the rest of the code, all the reads are uncoalesced. That will have a major impact on performance. Depending on what hardware you are using, you are looking at up to 16x slower memory access just from that.

As I said, taking out this line:

purchaseorders[i].TotalReturn += Value1Return + Value2Return + Value3Return;

makes all of the other code in the kernel redundant. The compiler will probably strip the entire kernel out to be empty without it.

As for the rest of the code, all the reads are uncoalesced. That will have a major impact on performance. Depending on what hardware you are using, you are looking at up to 16x slower memory access just from that.

I see.

What is the best practice to make the read and write access to memory coalesced?

Shouldn’t I use the an array of structure hold in the global memroy? shall I combin all the members of the individual struct instance into seperated arrays?

I see.

What is the best practice to make the read and write access to memory coalesced?

Shouldn’t I use the an array of structure hold in the global memroy? shall I combin all the members of the individual struct instance into seperated arrays?

That which is discussed in both the programming guide and the best practices guide. Ideally you want half-warps to be reading from contiguous 64, 128 or 256 byte segments of global memory. Anything “worse” than that will trigger extra sequential reads. How many and how much that hurts performance depends on the hardware version you are using.

Arrays of structures are generally hard to use optimally in CUDA. Flat arrays are easier to get coalesced memory access with. The thrust library has a very useful zip operator than can automagically flatten host side array of structures into a structure of arrays. It doesn’t mean you should unconditionally use flat arrays, but it requires careful thought and analysis to use arrays of structures well in device code.

That which is discussed in both the programming guide and the best practices guide. Ideally you want half-warps to be reading from contiguous 64, 128 or 256 byte segments of global memory. Anything “worse” than that will trigger extra sequential reads. How many and how much that hurts performance depends on the hardware version you are using.

Arrays of structures are generally hard to use optimally in CUDA. Flat arrays are easier to get coalesced memory access with. The thrust library has a very useful zip operator than can automagically flatten host side array of structures into a structure of arrays. It doesn’t mean you should unconditionally use flat arrays, but it requires careful thought and analysis to use arrays of structures well in device code.