Why is the program only 5 times faster? Please help!!!!

Dear All,

I’m working on a CFD simulation program, and with the usage of global memory, it’s only 5*times faster (GTX295 - first card) than on CPU (3,4 GHz). I was thinking, what could be the bottleneck of the program? For some information, the program does no send more than 24 MB/s data through the CPU-GPU. A typical kernel’s head looks like this:

[codebox]global void SolveNavierStokes_K_D(GPU_INT * KsiMax_D,GPU_INT * EtaMax_D,GPU_FLOAT *G_Ksi_D,GPU_FLOAT *F_Eta_D,GPU_FLOAT * Velocity_0XN_D,GPU_FLOAT * Velocity_0X_D,GPU_FLOAT * Velocity_0YN_D,GPU_FLOAT * Velocity_0Y_D,GPU_FLOAT *X_Coordinates_D,GPU_FLOAT *Y_Coordinates_D,GPU_FLOAT *Pressure_D,GPU_FLOAT *Jacobian_D, GPU_FLOAT *Alpha_D,GPU_FLOAT *Gamma_D,GPU_FLOAT *Sigma_D,GPU_FLOAT *Velocity_U0_D,GPU_FLOAT *Velocity_U_D,GPU_FLOAT *Velocity_V0_D,GPU_FLOAT *Velocity_V_D,GPU_FLOAT *DimensionlessTemperature_D,GPU_FLOAT *DimensionlessTemperature0_D,GPU_FLOAT *ReynoldsNumber_D,GPU_FLOAT *PrandtNumber_D,GPU_FLOAT *dt_TimeStep_D,GPU_FLOAT *RichardsonNumber_D);

[/codebox]

And an example of a (really short) kernel.

[codebox]global void ComputeTemperature_K_D(GPU_INT * KsiMax_D,GPU_INT * EtaMax_D,GPU_FLOAT * Temperature_D,GPU_FLOAT * DimensionlessTemperature_D,GPU_FLOAT *TemperatureWall_D,GPU_FLOAT *TemperatureInfinity_D)

{

int col = blockIdx.x * blockDim.x + threadIdx.x;

int row = blockIdx.y * blockDim.y + threadIdx.y;

GPU_INT KsiMax=KsiMax_D[0];

GPU_INT EtaMax=EtaMax_D[0];

GPU_FLOAT TemperatureInfinity=TemperatureInfinity_D[0];

if(row<EtaMax)

	Temperature_D[row*KsiMax+col]=DimensionlessTemperature_D[row

KsiMax+col](TemperatureWall_D[0]-TemperatureInfinity)+TemperatureInfinity;

};[/codebox]

The program works with lot of second-order differentials, and sometimes uses conditions like this:

[codebox] if(row>1 && row<(EtaMax-2))

{

	U_Eta=...

}

else

{[/codebox]

Is the if-else conditions the biggest problem, or the huge number of access to the global memory?

Thanks in advance,

                                               Laszlo Daroczy

Try profiling the code. That will give you a much better idea of the memory access and throughput you are getting and how much serialization is occurring because of branching. If both implementations are truly memory bandwidth bound, then you are probably looking at an upper bound for speed up of 6-10 times (depending on host and device memory performance). For what it is worth, I can get something like 10-15 times speed up solving transient convection-diffusion equations in 3D using a fully second order explicit SSP Runge-Kutta method over a reasonably good OpenMP implementation built with Intel’s compiler.