Why does it run faster on GTx 285 than GTX 570?

Aram · July 7, 2011, 9:51am

Hi.

I have such a function. When I run this on GTX285 it takes 2 seconds, But when I run it on GTX570 it takes 5 seconds!!

I use 512 threads per block for both. But I tried 1024 and 256 for gtx570 as well. And also 256 for gtx285. The results don’t change!.

Most arrays I pass are of type Double.

The parameters in the problem are:

T= 350000

M=1024

D=16

I know this has something to do with the architecture, but I cannot figure it out. Can anyone tell me what is causing this?

Thanks,

extern "C" __global__ void CREATE_SUM_P_CUDA(int T, int M, int D, double* det_sigmaArray, int det_sigmaArrayLen0, double* model_var, int model_varLen0, double* model_parm, int model_parmLen0, double* model_mean, int model_meanLen0, double* model_weight, int model_weightLen0, double* sum_p, int sum_pLen0, double sumPI)

{

	int i = blockIdx.x * blockDim.x + threadIdx.x;

	if (i < T)

	{

		sum_p[i] = 0.0;

		for (int k = 0; k < M; k++)

		{

			double sum = 0.0;

			for (int j = 0; j < D; j++)

			{

				double temp = model_parm[i * D + j] - model_mean[k * D + j];

				sum += temp * temp / model_var[k * D + j];

			}

			sum_p[i] += model_weight[k] * exp(-0.5 * sum) / (sumPI * sqrt(det_sigmaArray[k]));

		}

	}

}

cbuchner1 · July 7, 2011, 11:01am

Have you tried targeting the GTX 570 with SM_13 (compute capability 1.3 code) as well?

Aram · July 7, 2011, 11:09am

I just did. Still the same. 5 seconds.

tera · July 7, 2011, 11:40am

I’d guess Christian wanted to suggest not doing that. How does the GTX570 perform with code compiled for sm_20?

Maybe the GTX570 performs worse because of higher precision routines used on compute capability 2.0. How does your code do with [font=“Courier New”]–use_fast_math[/font]?

Aram · July 7, 2011, 11:49am

I have used sm_20 for GTX570 and sm_13 for GTX285. I’m not using fast math, but the interesting thing is only some of my functions are performing in such a way.

For example, the following functions take 28 seconds on GTX285 (sm_13) and 16 seconds on GTX570. I’m calling these functions consequently 1024 times. Notice that I have sqrt() and exp() functions. So I don’t think use_fast_math would help.

extern "C" __global__ void EM_Update_Mean_Var_Weight2_PartOne(int T, int M, int D, int k, double* det_sigmaArray, int det_sigmaArrayLen0, double* model_var, int model_varLen0, double* model_parm, int model_parmLen0, double* model_mean, int model_meanLen0, double* model_weight, int model_weightLen0, double* sum_p, int sum_pLen0, double sumPI, double* Sum_Mean, int Sum_MeanLen0, double* Sum_Var, int Sum_VarLen0, double* SUM_P_T, int SUM_P_TLen0)

{

	int num = blockIdx.x * blockDim.x + threadIdx.x;

	if (num < T)

	{

		double num2 = 0.0;

		for (int i = 0; i < D; i++)

		{

			double num3 = model_parm[num * D + i] - model_mean[k * D + i];

			num2 += num3 * num3 / model_var[k * D + i];

		}

		double num4 = model_weight[k] * exp(-0.5 * num2) / (sumPI * sqrt(det_sigmaArray[k]));

		SUM_P_T[0] += num4 / sum_p[num];

		for (int i = 0; i < D; i++)

		{

			Sum_Mean[k * D + i] += num4 / sum_p[num] * model_parm[num * D + i];

			Sum_Var[k * D + i] += num4 / sum_p[num] * (model_parm[num * D + i] * model_parm[num * D + i]);

		}

	}

}

extern "C" __global__ void EM_Update_Mean_Var_Weight2_PartTwo(int T, int M, int D, int k, double* det_sigmaArray, int det_sigmaArrayLen0, double* model_var, int model_varLen0, double* model_parm, int model_parmLen0, double* model_mean, int model_meanLen0, double* model_weight, int model_weightLen0, double* sum_p, int sum_pLen0, double sumPI, double* Sum_Mean, int Sum_MeanLen0, double* Sum_Var, int Sum_VarLen0, double* SUM_P_T, int SUM_P_TLen0)

{

	int num = blockIdx.x * blockDim.x + threadIdx.x;

	if (num < D)

	{

		model_mean[k * D + num] = Sum_Mean[num] / SUM_P_T[0];

		model_var[k * D + num] = Sum_Var[num] / SUM_P_T[0] - model_mean[k * D + num] * model_mean[k * D + num];

		if (model_var[k * D + num] <= 0.0099999997764825821)

		{

			model_var[k * D + num] = 0.0099999997764825821;

		}

	}

}

And what do you mean by higher precision routines? are you talking about ECC?

cbuchner1 · July 7, 2011, 11:50am

No, I meant what I said.

Here at work we often execute our code Fermi devices with code compiled on CUDA SDK 2.3, targeting sm_10 (with embedded PTX code in the binary). This is because we don’t need any of the new features provided by CUDA SDKs 3.x and 4.0

And we find that performance scales as expected with the number of CUDA cores (the exception being SM_21 devices where the final third of the CUDA cores are not always used for architectural reasons).

No need for us to upgrade the SDK at the moment. We’re not seeing any performance degradation, and hence my suggestion that he should try targeting sm_13.

Christian

Aram · July 7, 2011, 11:54am

You may be right. But as I said I still get the same amount of time on both sm_13 and sm_20.

I’m occupying my memory will less than 300 MBytes of data. So I know it is not a memory-bound problem. What else can you guys suggest pointing at?

cbuchner1 · July 7, 2011, 12:35pm

Investigate if you have a case of memory partition camping here.

Aram · July 7, 2011, 1:46pm

Guys I figured it out.

It seems that because I have many memory fetches per thread, having less threads in a block may improve the result.

So I tried these for my block dimensions and see how the results are :

On GTX 570

#.Threads(Per Block)    Time Taken

-------------           ----------

1024                       5 secs

512                        5 secs

256                        5 secs

192                        5 secs

128                        5 secs

64                         5 secs

40                         3 secs

20                         2 secs

16                         3 secs

10                         4 secs (yes, I dared trying)

So you see, this really affects the performance.

My other function that I was using also decreased from 512 threads in 20 secs to 8 seconds for 20 threads!

Thank you for pointing the camping cbuchner1.

cbuchner1 · July 7, 2011, 1:56pm

Maybe consider changing the memory layout of your data structures (padding, etc…) instead of reducing occupancy.

eyalhir74 · July 7, 2011, 2:22pm

Going to 20 threads per block seems weird.

Is that intentionaly or just a typo:

for (int k = 0; k < M; i++)

Didn’t you mean:

for (int k = 0; k < M; k++)

also your memory access pattern seems weird - maybe try shared memory or change the data layout as Christian suggested.

eyal

Aram · July 7, 2011, 2:30pm

Going to 20 threads per block seems weird.

Is that intentionaly or just a typo:
for (int k = 0; k < M; i++)
Didn’t you mean:
for (int k = 0; k < M; k++)
also your memory access pattern seems weird - maybe try shared memory or change the data layout as Christian suggested.

eyal

That was a typo. thank you.

This is basically the structure I’m trying to access:

For each Value of T (samples)

     For each Value of M (mixtures in the sample)

         For each Value of D (dimensions (features) of that sample)

             Do something.

    Calculate Sum_p for that Sample

The reason for decreasing the number of threads was this seminar: “Better Performance at Lower Occupancy GTC2010”

I tried and it worked. Currently I’m on a time constraint so may be after august I will change my implementation to shared memory.

But thank you.

eyalhir74 · July 7, 2011, 2:44pm

“Better Performance at Lower Occupancy GTC2010” doesnt mean that if you decrease the threads per block you’ll get more performance.

You still need to do the same amount of work, now with less threads. This is what Volkov meant - you decrease the amount of threads

but give each thread more work so you end up with the same amount of work being done.

What I’m trying to say that from what you describe the kernel probably suffers from other issues which you might want to address:

register pressure, L1 issues, lack of shared memory, in-efficient global memory access pattern, …

my 1 cent,

eyal

That was a typo. thank you.

This is basically the structure I’m trying to access:
For each Value of T (samples)

     For each Value of M (mixtures in the sample)

         For each Value of D (dimensions (features) of that sample)

             Do something.

    Calculate Sum_p for that Sample
The reason for decreasing the number of threads was this seminar: “Better Performance at Lower Occupancy GTC2010”

I tried and it worked. Currently I’m on a time constraint so may be after august I will change my implementation to shared memory.

But thank you.

Aram · July 7, 2011, 2:49pm

Thanks, But does this mean at lower number of threads per block (currently using) I can get higher speeds by solving those issues? Or do you mean I can get the same speed up with other approaches? (except for the shared memory which I’m aware of having much faster results)

eyalhir74 · July 7, 2011, 3:00pm

If you’re happy with the performance you get - I guess don’t optimize more.

However from what you describe it seems that your kernel is bounded by stuff I’ve mentioned above and reducing the

threads per block is just the sympthom and not the real problem.

You got down to 2 seconds with 20 threads - if you think/want to improve your performance you should investigate

more where are your bottlenecks and solve them. Just reducing the number of threads (to 20 and not even a multiple of 32)

and getting more performnace, seems to me like its just hiding the real problem.

Thats why GPUs are so beautiful - you really have to play and understand them to get the max out of it and its not easy

or straightforward - but it is very rewarding :)

eyal

Aram · July 7, 2011, 3:10pm

I couldn’t agree more.

Topic		Replies	Views
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16290	January 30, 2011
Slow Performance CUDA Programming and Performance	19	8125	November 24, 2008
Is GPU worth it? GPU currently too slow. CUDA Programming and Performance	16	6033	December 8, 2008
Fewer threads per block = ... faster performance? CUDA Programming and Performance	9	69	December 31, 2024
Maximum number of threads How to find maximum number of threads your Card can support CUDA Programming and Performance	16	10252	July 7, 2009
Putting the GPU at work CUDA Programming and Performance	21	20171	July 5, 2007
Fastest matrix-vector multiplication? CUDA Programming and Performance	24	3759	May 21, 2011
Dynamic Shared Memory Tesla vs Fermi CUDA Programming and Performance	0	999	May 26, 2011
Doubts related to CUDA CUDA Programming and Performance	17	11801	November 18, 2010
Efficient use of shared memory CUDA Programming and Performance	29	4215	December 2, 2019

Why does it run faster on GTx 285 than GTX 570?

Related topics