Dear All,
I’m working on a CFD simulation program, and with the usage of global memory, it’s only 5*times faster (GTX295 - first card) than on CPU (3,4 GHz). I was thinking, what could be the bottleneck of the program? For some information, the program does no send more than 24 MB/s data through the CPU-GPU. A typical kernel’s head looks like this:
[codebox]global void SolveNavierStokes_K_D(GPU_INT * KsiMax_D,GPU_INT * EtaMax_D,GPU_FLOAT *G_Ksi_D,GPU_FLOAT *F_Eta_D,GPU_FLOAT * Velocity_0XN_D,GPU_FLOAT * Velocity_0X_D,GPU_FLOAT * Velocity_0YN_D,GPU_FLOAT * Velocity_0Y_D,GPU_FLOAT *X_Coordinates_D,GPU_FLOAT *Y_Coordinates_D,GPU_FLOAT *Pressure_D,GPU_FLOAT *Jacobian_D, GPU_FLOAT *Alpha_D,GPU_FLOAT *Gamma_D,GPU_FLOAT *Sigma_D,GPU_FLOAT *Velocity_U0_D,GPU_FLOAT *Velocity_U_D,GPU_FLOAT *Velocity_V0_D,GPU_FLOAT *Velocity_V_D,GPU_FLOAT *DimensionlessTemperature_D,GPU_FLOAT *DimensionlessTemperature0_D,GPU_FLOAT *ReynoldsNumber_D,GPU_FLOAT *PrandtNumber_D,GPU_FLOAT *dt_TimeStep_D,GPU_FLOAT *RichardsonNumber_D);
[/codebox]
And an example of a (really short) kernel.
[codebox]global void ComputeTemperature_K_D(GPU_INT * KsiMax_D,GPU_INT * EtaMax_D,GPU_FLOAT * Temperature_D,GPU_FLOAT * DimensionlessTemperature_D,GPU_FLOAT *TemperatureWall_D,GPU_FLOAT *TemperatureInfinity_D)
{
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
GPU_INT KsiMax=KsiMax_D[0];
GPU_INT EtaMax=EtaMax_D[0];
GPU_FLOAT TemperatureInfinity=TemperatureInfinity_D[0];
if(row<EtaMax)
Temperature_D[row*KsiMax+col]=DimensionlessTemperature_D[row
KsiMax+col](TemperatureWall_D[0]-TemperatureInfinity)+TemperatureInfinity;
};[/codebox]
The program works with lot of second-order differentials, and sometimes uses conditions like this:
[codebox] if(row>1 && row<(EtaMax-2))
{
U_Eta=...
}
else
{[/codebox]
Is the if-else conditions the biggest problem, or the huge number of access to the global memory?
Thanks in advance,
Laszlo Daroczy