The difference in the two execution times is due to compiler optimization. By removing the write to global memory from the code, the compiler is smart enough to optimize all of the preceding calculations leading to the result away (“dead code removal”).
It is impossible to say more without seeing some actual code, but it is likely that the underlying poor performance is caused by uncoalesced memory access.
The difference in the two execution times is due to compiler optimization. By removing the write to global memory from the code, the compiler is smart enough to optimize all of the preceding calculations leading to the result away (“dead code removal”).
It is impossible to say more without seeing some actual code, but it is likely that the underlying poor performance is caused by uncoalesced memory access.
Please see the code below. The line with block comments is causing the problem. I will take the Simulation and Steps loops out later (the advice you gave in another post) but they are all set to 1 at this moment.
__global__ void RunMonteCarlo(PurchaseOrder *purchaseorders, long devPurchaseOrderLength, SimulationOptions hostSimulationOption)
{
int i = threadIdx.x + blockIdx.x * blockDim.x;
int stride = blockDim.x * gridDim.x;
while (i < devPurchaseOrderLength)
{
// The loop for each simulation
for(int j = 0; j < hostSimulationOptions.NumberofSimulation; j++)
{
// The loop for each step
for(int k = 0; k < hostSimulationOptions.NumberofStep; k++)
{
//work out the Value1 Return
for (int index = 0; index < 15; index++)
{
Value1Return += purchaseorders[i].PartialValue1[index]*1;
}
//work out the Value2 Return
for (int index = 0; index < purchaseorders[i].PartialValue2Length; index++)
{
Value2Return += purchaseorders[i].PartialValue2[Value2]*2;
}
//work out the Value3 Return
for (int index = 0; index < purchaseorders[i].PartialValue3Length; index++)
{
Value3Return += purchaseorders[i].PartialValue3[Value3]*3;
}
/* This line is causing huge overhead */
purchaseorders[i].TotalReturn += Value1Return + Value2Return + Value3Return;
}
}
i += stride;
}
}
Please see the code below. The line with block comments is causing the problem. I will take the Simulation and Steps loops out later (the advice you gave in another post) but they are all set to 1 at this moment.
__global__ void RunMonteCarlo(PurchaseOrder *purchaseorders, long devPurchaseOrderLength, SimulationOptions hostSimulationOption)
{
int i = threadIdx.x + blockIdx.x * blockDim.x;
int stride = blockDim.x * gridDim.x;
while (i < devPurchaseOrderLength)
{
// The loop for each simulation
for(int j = 0; j < hostSimulationOptions.NumberofSimulation; j++)
{
// The loop for each step
for(int k = 0; k < hostSimulationOptions.NumberofStep; k++)
{
//work out the Value1 Return
for (int index = 0; index < 15; index++)
{
Value1Return += purchaseorders[i].PartialValue1[index]*1;
}
//work out the Value2 Return
for (int index = 0; index < purchaseorders[i].PartialValue2Length; index++)
{
Value2Return += purchaseorders[i].PartialValue2[Value2]*2;
}
//work out the Value3 Return
for (int index = 0; index < purchaseorders[i].PartialValue3Length; index++)
{
Value3Return += purchaseorders[i].PartialValue3[Value3]*3;
}
/* This line is causing huge overhead */
purchaseorders[i].TotalReturn += Value1Return + Value2Return + Value3Return;
}
}
i += stride;
}
}
makes all of the other code in the kernel redundant. The compiler will probably strip the entire kernel out to be empty without it.
As for the rest of the code, all the reads are uncoalesced. That will have a major impact on performance. Depending on what hardware you are using, you are looking at up to 16x slower memory access just from that.
makes all of the other code in the kernel redundant. The compiler will probably strip the entire kernel out to be empty without it.
As for the rest of the code, all the reads are uncoalesced. That will have a major impact on performance. Depending on what hardware you are using, you are looking at up to 16x slower memory access just from that.
What is the best practice to make the read and write access to memory coalesced?
Shouldn’t I use the an array of structure hold in the global memroy? shall I combin all the members of the individual struct instance into seperated arrays?
What is the best practice to make the read and write access to memory coalesced?
Shouldn’t I use the an array of structure hold in the global memroy? shall I combin all the members of the individual struct instance into seperated arrays?
That which is discussed in both the programming guide and the best practices guide. Ideally you want half-warps to be reading from contiguous 64, 128 or 256 byte segments of global memory. Anything “worse” than that will trigger extra sequential reads. How many and how much that hurts performance depends on the hardware version you are using.
Arrays of structures are generally hard to use optimally in CUDA. Flat arrays are easier to get coalesced memory access with. The thrust library has a very useful zip operator than can automagically flatten host side array of structures into a structure of arrays. It doesn’t mean you should unconditionally use flat arrays, but it requires careful thought and analysis to use arrays of structures well in device code.
That which is discussed in both the programming guide and the best practices guide. Ideally you want half-warps to be reading from contiguous 64, 128 or 256 byte segments of global memory. Anything “worse” than that will trigger extra sequential reads. How many and how much that hurts performance depends on the hardware version you are using.
Arrays of structures are generally hard to use optimally in CUDA. Flat arrays are easier to get coalesced memory access with. The thrust library has a very useful zip operator than can automagically flatten host side array of structures into a structure of arrays. It doesn’t mean you should unconditionally use flat arrays, but it requires careful thought and analysis to use arrays of structures well in device code.