global memory update overhead

London · October 19, 2010, 9:29am

Hi, we have an array of STRUCT PurchaseOrder in global memory.

typedef struct PurchaseOrder

{

  double value1;

  double value2;

  double value 3;

double total;

}

PurchaseOrder *purchaseorder;

We passed this array of STRUCT by reference(pointer) into our kernel function.

global kernel(PurchaseOrder *purchaseOrder)

In the kernel, we calculate the sum of members of STRUCT and write the total back into another member of the STRUCT.

purchaseOrder[0].total += purchaseOrder[0].value1 + purchaseOrder[0].value2 + purchaseOrder[0].value3;

We made the array to contain one PurchaseOrder only to avoid the multiple threads trying to access global memory at the same time.

It runs okey with the line for assigning the value to the total. However, it took 250 ms.

if we commented the line out for assigning the value to the total, it only took 0.2ms.

Where is that overhead coming from?

Thanks

London · October 19, 2010, 9:29am

Hi, we have an array of STRUCT PurchaseOrder in global memory.

typedef struct PurchaseOrder

{

  double value1;

  double value2;

  double value 3;

double total;

}

PurchaseOrder *purchaseorder;

We passed this array of STRUCT by reference(pointer) into our kernel function.

global kernel(PurchaseOrder *purchaseOrder)

In the kernel, we calculate the sum of members of STRUCT and write the total back into another member of the STRUCT.

purchaseOrder[0].total += purchaseOrder[0].value1 + purchaseOrder[0].value2 + purchaseOrder[0].value3;

We made the array to contain one PurchaseOrder only to avoid the multiple threads trying to access global memory at the same time.

It runs okey with the line for assigning the value to the total. However, it took 250 ms.

if we commented the line out for assigning the value to the total, it only took 0.2ms.

Where is that overhead coming from?

Thanks

avidday · October 19, 2010, 10:08am

The difference in the two execution times is due to compiler optimization. By removing the write to global memory from the code, the compiler is smart enough to optimize all of the preceding calculations leading to the result away (“dead code removal”).

It is impossible to say more without seeing some actual code, but it is likely that the underlying poor performance is caused by uncoalesced memory access.

avidday · October 19, 2010, 10:08am

The difference in the two execution times is due to compiler optimization. By removing the write to global memory from the code, the compiler is smart enough to optimize all of the preceding calculations leading to the result away (“dead code removal”).

It is impossible to say more without seeing some actual code, but it is likely that the underlying poor performance is caused by uncoalesced memory access.

London · October 19, 2010, 12:38pm

Hi avidday,

Please see the code below. The line with block comments is causing the problem. I will take the Simulation and Steps loops out later (the advice you gave in another post) but they are all set to 1 at this moment.

__global__ void RunMonteCarlo(PurchaseOrder *purchaseorders, long devPurchaseOrderLength, SimulationOptions hostSimulationOption)

	{

		int i = threadIdx.x + blockIdx.x * blockDim.x;		 

		int stride = blockDim.x * gridDim.x;

		while (i < devPurchaseOrderLength) 

		{		

		   // The loop for each simulation

		   for(int j = 0; j < hostSimulationOptions.NumberofSimulation; j++)

		   {				 

				 // The loop for each step

				 for(int k = 0; k < hostSimulationOptions.NumberofStep; k++)

				 {

					   //work out the Value1 Return

					   for (int index = 0; index < 15; index++)

					   {

							Value1Return += purchaseorders[i].PartialValue1[index]*1;

					   }

					   //work out the Value2 Return					   

					   for (int index = 0; index < purchaseorders[i].PartialValue2Length; index++)

					   {

							Value2Return += purchaseorders[i].PartialValue2[Value2]*2;

					   }

					   

					   //work out the Value3 Return						

					   for (int index = 0; index < purchaseorders[i].PartialValue3Length; index++)

					   {

							Value3Return += purchaseorders[i].PartialValue3[Value3]*3;

					   }					   

				  

																				  /* This line is causing huge overhead */

					  purchaseorders[i].TotalReturn += Value1Return + Value2Return + Value3Return;	 				

				  }		 

			}

			i += stride;

		}

	}

London · October 19, 2010, 12:38pm

Hi avidday,

Please see the code below. The line with block comments is causing the problem. I will take the Simulation and Steps loops out later (the advice you gave in another post) but they are all set to 1 at this moment.

__global__ void RunMonteCarlo(PurchaseOrder *purchaseorders, long devPurchaseOrderLength, SimulationOptions hostSimulationOption)

	{

		int i = threadIdx.x + blockIdx.x * blockDim.x;		 

		int stride = blockDim.x * gridDim.x;

		while (i < devPurchaseOrderLength) 

		{		

		   // The loop for each simulation

		   for(int j = 0; j < hostSimulationOptions.NumberofSimulation; j++)

		   {				 

				 // The loop for each step

				 for(int k = 0; k < hostSimulationOptions.NumberofStep; k++)

				 {

					   //work out the Value1 Return

					   for (int index = 0; index < 15; index++)

					   {

							Value1Return += purchaseorders[i].PartialValue1[index]*1;

					   }

					   //work out the Value2 Return					   

					   for (int index = 0; index < purchaseorders[i].PartialValue2Length; index++)

					   {

							Value2Return += purchaseorders[i].PartialValue2[Value2]*2;

					   }

					   

					   //work out the Value3 Return						

					   for (int index = 0; index < purchaseorders[i].PartialValue3Length; index++)

					   {

							Value3Return += purchaseorders[i].PartialValue3[Value3]*3;

					   }					   

				  

																				  /* This line is causing huge overhead */

					  purchaseorders[i].TotalReturn += Value1Return + Value2Return + Value3Return;	 				

				  }		 

			}

			i += stride;

		}

	}

avidday · October 19, 2010, 1:02pm

As I said, taking out this line:

purchaseorders[i].TotalReturn += Value1Return + Value2Return + Value3Return;

makes all of the other code in the kernel redundant. The compiler will probably strip the entire kernel out to be empty without it.

As for the rest of the code, all the reads are uncoalesced. That will have a major impact on performance. Depending on what hardware you are using, you are looking at up to 16x slower memory access just from that.

avidday · October 19, 2010, 1:02pm

As I said, taking out this line:

purchaseorders[i].TotalReturn += Value1Return + Value2Return + Value3Return;

makes all of the other code in the kernel redundant. The compiler will probably strip the entire kernel out to be empty without it.

As for the rest of the code, all the reads are uncoalesced. That will have a major impact on performance. Depending on what hardware you are using, you are looking at up to 16x slower memory access just from that.

London · October 19, 2010, 1:59pm

As I said, taking out this line:
purchaseorders[i].TotalReturn += Value1Return + Value2Return + Value3Return;
makes all of the other code in the kernel redundant. The compiler will probably strip the entire kernel out to be empty without it.

As for the rest of the code, all the reads are uncoalesced. That will have a major impact on performance. Depending on what hardware you are using, you are looking at up to 16x slower memory access just from that.

I see.

What is the best practice to make the read and write access to memory coalesced?

Shouldn’t I use the an array of structure hold in the global memroy? shall I combin all the members of the individual struct instance into seperated arrays?

London · October 19, 2010, 1:59pm

As I said, taking out this line:
purchaseorders[i].TotalReturn += Value1Return + Value2Return + Value3Return;
makes all of the other code in the kernel redundant. The compiler will probably strip the entire kernel out to be empty without it.

As for the rest of the code, all the reads are uncoalesced. That will have a major impact on performance. Depending on what hardware you are using, you are looking at up to 16x slower memory access just from that.

I see.

What is the best practice to make the read and write access to memory coalesced?

Shouldn’t I use the an array of structure hold in the global memroy? shall I combin all the members of the individual struct instance into seperated arrays?

avidday · October 19, 2010, 2:27pm

That which is discussed in both the programming guide and the best practices guide. Ideally you want half-warps to be reading from contiguous 64, 128 or 256 byte segments of global memory. Anything “worse” than that will trigger extra sequential reads. How many and how much that hurts performance depends on the hardware version you are using.

Arrays of structures are generally hard to use optimally in CUDA. Flat arrays are easier to get coalesced memory access with. The thrust library has a very useful zip operator than can automagically flatten host side array of structures into a structure of arrays. It doesn’t mean you should unconditionally use flat arrays, but it requires careful thought and analysis to use arrays of structures well in device code.

avidday · October 19, 2010, 2:27pm

That which is discussed in both the programming guide and the best practices guide. Ideally you want half-warps to be reading from contiguous 64, 128 or 256 byte segments of global memory. Anything “worse” than that will trigger extra sequential reads. How many and how much that hurts performance depends on the hardware version you are using.

Arrays of structures are generally hard to use optimally in CUDA. Flat arrays are easier to get coalesced memory access with. The thrust library has a very useful zip operator than can automagically flatten host side array of structures into a structure of arrays. It doesn’t mean you should unconditionally use flat arrays, but it requires careful thought and analysis to use arrays of structures well in device code.

Topic		Replies	Views
Please help with __shared__ memory different usage than in samples CUDA Programming and Performance	30	3332	January 10, 2010
How to Access Global Memory Efficiently in CUDA C/C++ Kernels Technical Blog	7	658	December 5, 2019
coalescing struct loading problem CUDA Programming and Performance	21	12742	March 5, 2010
Using Shared Memory in CUDA C/C++ Technical Blog	36	2013	October 8, 2020
Transfer-Bound Application Looking for ideas to speed it up CUDA Programming and Performance	36	29340	April 23, 2010
read from global mem vs write to global mem CUDA Programming and Performance	13	6427	January 22, 2009
Doubts related to CUDA CUDA Programming and Performance	17	11815	November 18, 2010
Coalesced Memory access related doubt CUDA Programming and Performance	13	2037	December 9, 2010
About coalescing CUDA Programming and Performance	6	2627	April 16, 2010
CUDA Memory Consistency CUDA Programming and Performance	23	55614	March 8, 2007

global memory update overhead

Related topics