Why does it run faster on GTx 285 than GTX 570?

Hi.

I have such a function. When I run this on GTX285 it takes 2 seconds, But when I run it on GTX570 it takes 5 seconds!!

I use 512 threads per block for both. But I tried 1024 and 256 for gtx570 as well. And also 256 for gtx285. The results don’t change!.

Most arrays I pass are of type Double.

The parameters in the problem are:

T= 350000

M=1024

D=16

I know this has something to do with the architecture, but I cannot figure it out. Can anyone tell me what is causing this?

Thanks,

extern "C" __global__ void CREATE_SUM_P_CUDA(int T, int M, int D, double* det_sigmaArray, int det_sigmaArrayLen0, double* model_var, int model_varLen0, double* model_parm, int model_parmLen0, double* model_mean, int model_meanLen0, double* model_weight, int model_weightLen0, double* sum_p, int sum_pLen0, double sumPI)

{

	int i = blockIdx.x * blockDim.x + threadIdx.x;

	if (i < T)

	{

		sum_p[i] = 0.0;

		for (int k = 0; k < M; k++)

		{

			double sum = 0.0;

			for (int j = 0; j < D; j++)

			{

				double temp = model_parm[i * D + j] - model_mean[k * D + j];

				sum += temp * temp / model_var[k * D + j];

			}

			sum_p[i] += model_weight[k] * exp(-0.5 * sum) / (sumPI * sqrt(det_sigmaArray[k]));

		}

	}

}

Have you tried targeting the GTX 570 with SM_13 (compute capability 1.3 code) as well?

I just did. Still the same. 5 seconds.

I’d guess Christian wanted to suggest not doing that. How does the GTX570 perform with code compiled for sm_20?

Maybe the GTX570 performs worse because of higher precision routines used on compute capability 2.0. How does your code do with [font=“Courier New”]–use_fast_math[/font]?

I have used sm_20 for GTX570 and sm_13 for GTX285. I’m not using fast math, but the interesting thing is only some of my functions are performing in such a way.

For example, the following functions take 28 seconds on GTX285 (sm_13) and 16 seconds on GTX570. I’m calling these functions consequently 1024 times. Notice that I have sqrt() and exp() functions. So I don’t think use_fast_math would help.

extern "C" __global__ void EM_Update_Mean_Var_Weight2_PartOne(int T, int M, int D, int k, double* det_sigmaArray, int det_sigmaArrayLen0, double* model_var, int model_varLen0, double* model_parm, int model_parmLen0, double* model_mean, int model_meanLen0, double* model_weight, int model_weightLen0, double* sum_p, int sum_pLen0, double sumPI, double* Sum_Mean, int Sum_MeanLen0, double* Sum_Var, int Sum_VarLen0, double* SUM_P_T, int SUM_P_TLen0)

{

	int num = blockIdx.x * blockDim.x + threadIdx.x;

	if (num < T)

	{

		double num2 = 0.0;

		for (int i = 0; i < D; i++)

		{

			double num3 = model_parm[num * D + i] - model_mean[k * D + i];

			num2 += num3 * num3 / model_var[k * D + i];

		}

		double num4 = model_weight[k] * exp(-0.5 * num2) / (sumPI * sqrt(det_sigmaArray[k]));

		SUM_P_T[0] += num4 / sum_p[num];

		for (int i = 0; i < D; i++)

		{

			Sum_Mean[k * D + i] += num4 / sum_p[num] * model_parm[num * D + i];

			Sum_Var[k * D + i] += num4 / sum_p[num] * (model_parm[num * D + i] * model_parm[num * D + i]);

		}

	}

}

extern "C" __global__ void EM_Update_Mean_Var_Weight2_PartTwo(int T, int M, int D, int k, double* det_sigmaArray, int det_sigmaArrayLen0, double* model_var, int model_varLen0, double* model_parm, int model_parmLen0, double* model_mean, int model_meanLen0, double* model_weight, int model_weightLen0, double* sum_p, int sum_pLen0, double sumPI, double* Sum_Mean, int Sum_MeanLen0, double* Sum_Var, int Sum_VarLen0, double* SUM_P_T, int SUM_P_TLen0)

{

	int num = blockIdx.x * blockDim.x + threadIdx.x;

	if (num < D)

	{

		model_mean[k * D + num] = Sum_Mean[num] / SUM_P_T[0];

		model_var[k * D + num] = Sum_Var[num] / SUM_P_T[0] - model_mean[k * D + num] * model_mean[k * D + num];

		if (model_var[k * D + num] <= 0.0099999997764825821)

		{

			model_var[k * D + num] = 0.0099999997764825821;

		}

	}

}

And what do you mean by higher precision routines? are you talking about ECC?

No, I meant what I said.

Here at work we often execute our code Fermi devices with code compiled on CUDA SDK 2.3, targeting sm_10 (with embedded PTX code in the binary). This is because we don’t need any of the new features provided by CUDA SDKs 3.x and 4.0

And we find that performance scales as expected with the number of CUDA cores (the exception being SM_21 devices where the final third of the CUDA cores are not always used for architectural reasons).

No need for us to upgrade the SDK at the moment. We’re not seeing any performance degradation, and hence my suggestion that he should try targeting sm_13.

Christian

You may be right. But as I said I still get the same amount of time on both sm_13 and sm_20.

I’m occupying my memory will less than 300 MBytes of data. So I know it is not a memory-bound problem. What else can you guys suggest pointing at?

Investigate if you have a case of memory partition camping here.

Guys I figured it out.

It seems that because I have many memory fetches per thread, having less threads in a block may improve the result.

So I tried these for my block dimensions and see how the results are :

On GTX 570

#.Threads(Per Block)    Time Taken

-------------           ----------

1024                       5 secs

512                        5 secs

256                        5 secs

192                        5 secs

128                        5 secs

64                         5 secs

40                         3 secs

20                         2 secs

16                         3 secs

10                         4 secs (yes, I dared trying)

So you see, this really affects the performance.

My other function that I was using also decreased from 512 threads in 20 secs to 8 seconds for 20 threads!

Thank you for pointing the camping cbuchner1.

Maybe consider changing the memory layout of your data structures (padding, etc…) instead of reducing occupancy.

Going to 20 threads per block seems weird.

Is that intentionaly or just a typo:

for (int k = 0; k < M; i++)

Didn’t you mean:

for (int k = 0; k < M; k++)

also your memory access pattern seems weird - maybe try shared memory or change the data layout as Christian suggested.

eyal

That was a typo. thank you.

This is basically the structure I’m trying to access:

For each Value of T (samples)

     For each Value of M (mixtures in the sample)

         For each Value of D (dimensions (features) of that sample)

             Do something.

    Calculate Sum_p for that Sample

The reason for decreasing the number of threads was this seminar: “Better Performance at Lower Occupancy GTC2010”

I tried and it worked. Currently I’m on a time constraint so may be after august I will change my implementation to shared memory.

But thank you.

“Better Performance at Lower Occupancy GTC2010” doesnt mean that if you decrease the threads per block you’ll get more performance.

You still need to do the same amount of work, now with less threads. This is what Volkov meant - you decrease the amount of threads

but give each thread more work so you end up with the same amount of work being done.

What I’m trying to say that from what you describe the kernel probably suffers from other issues which you might want to address:

register pressure, L1 issues, lack of shared memory, in-efficient global memory access pattern, …

my 1 cent,

eyal

Thanks, But does this mean at lower number of threads per block (currently using) I can get higher speeds by solving those issues? Or do you mean I can get the same speed up with other approaches? (except for the shared memory which I’m aware of having much faster results)

If you’re happy with the performance you get - I guess don’t optimize more.

However from what you describe it seems that your kernel is bounded by stuff I’ve mentioned above and reducing the

threads per block is just the sympthom and not the real problem.

You got down to 2 seconds with 20 threads - if you think/want to improve your performance you should investigate

more where are your bottlenecks and solve them. Just reducing the number of threads (to 20 and not even a multiple of 32)

and getting more performnace, seems to me like its just hiding the real problem.

Thats why GPUs are so beautiful - you really have to play and understand them to get the max out of it and its not easy

or straightforward - but it is very rewarding :)

eyal

I couldn’t agree more.