I’d guess Christian wanted to suggest not doing that. How does the GTX570 perform with code compiled for sm_20?

Maybe the GTX570 performs worse because of higher precision routines used on compute capability 2.0. How does your code do with [font=“Courier New”]–use_fast_math[/font]?

I have used sm_20 for GTX570 and sm_13 for GTX285. I’m not using fast math, but the interesting thing is only some of my functions are performing in such a way.

For example, the following functions take 28 seconds on GTX285 (sm_13) and 16 seconds on GTX570. I’m calling these functions consequently 1024 times. Notice that I have sqrt() and exp() functions. So I don’t think use_fast_math would help.

extern "C" __global__ void EM_Update_Mean_Var_Weight2_PartOne(int T, int M, int D, int k, double* det_sigmaArray, int det_sigmaArrayLen0, double* model_var, int model_varLen0, double* model_parm, int model_parmLen0, double* model_mean, int model_meanLen0, double* model_weight, int model_weightLen0, double* sum_p, int sum_pLen0, double sumPI, double* Sum_Mean, int Sum_MeanLen0, double* Sum_Var, int Sum_VarLen0, double* SUM_P_T, int SUM_P_TLen0)
{
int num = blockIdx.x * blockDim.x + threadIdx.x;
if (num < T)
{
double num2 = 0.0;
for (int i = 0; i < D; i++)
{
double num3 = model_parm[num * D + i] - model_mean[k * D + i];
num2 += num3 * num3 / model_var[k * D + i];
}
double num4 = model_weight[k] * exp(-0.5 * num2) / (sumPI * sqrt(det_sigmaArray[k]));
SUM_P_T[0] += num4 / sum_p[num];
for (int i = 0; i < D; i++)
{
Sum_Mean[k * D + i] += num4 / sum_p[num] * model_parm[num * D + i];
Sum_Var[k * D + i] += num4 / sum_p[num] * (model_parm[num * D + i] * model_parm[num * D + i]);
}
}
}
extern "C" __global__ void EM_Update_Mean_Var_Weight2_PartTwo(int T, int M, int D, int k, double* det_sigmaArray, int det_sigmaArrayLen0, double* model_var, int model_varLen0, double* model_parm, int model_parmLen0, double* model_mean, int model_meanLen0, double* model_weight, int model_weightLen0, double* sum_p, int sum_pLen0, double sumPI, double* Sum_Mean, int Sum_MeanLen0, double* Sum_Var, int Sum_VarLen0, double* SUM_P_T, int SUM_P_TLen0)
{
int num = blockIdx.x * blockDim.x + threadIdx.x;
if (num < D)
{
model_mean[k * D + num] = Sum_Mean[num] / SUM_P_T[0];
model_var[k * D + num] = Sum_Var[num] / SUM_P_T[0] - model_mean[k * D + num] * model_mean[k * D + num];
if (model_var[k * D + num] <= 0.0099999997764825821)
{
model_var[k * D + num] = 0.0099999997764825821;
}
}
}

And what do you mean by higher precision routines? are you talking about ECC?

Here at work we often execute our code Fermi devices with code compiled on CUDA SDK 2.3, targeting sm_10 (with embedded PTX code in the binary). This is because we don’t need any of the new features provided by CUDA SDKs 3.x and 4.0

And we find that performance scales as expected with the number of CUDA cores (the exception being SM_21 devices where the final third of the CUDA cores are not always used for architectural reasons).

No need for us to upgrade the SDK at the moment. We’re not seeing any performance degradation, and hence my suggestion that he should try targeting sm_13.

This is basically the structure I’m trying to access:

For each Value of T (samples)
For each Value of M (mixtures in the sample)
For each Value of D (dimensions (features) of that sample)
Do something.
Calculate Sum_p for that Sample

The reason for decreasing the number of threads was this seminar: “Better Performance at Lower Occupancy GTC2010”

I tried and it worked. Currently I’m on a time constraint so may be after august I will change my implementation to shared memory.

Thanks, But does this mean at lower number of threads per block (currently using) I can get higher speeds by solving those issues? Or do you mean I can get the same speed up with other approaches? (except for the shared memory which I’m aware of having much faster results)