cant understand this performance hit

Hi,

I am confused by the following problem (I am new to CUDA):

I have the following host code (Windows Xp 32 bit, Nvidia GTX 295, Visual Studio 2008):

dim3 dimBlock(400,1,1);
dim3 dimGrid(10,1,1);
for (i=2;i<2*N-1;i++)
{
if (i<=N)
{
low_j = 1;
high_j= i-1;
}
else
{
low_j = i-N+1;
high_j = N-1;
}
kernel<<<dimGrid,dimBlock>>>(d_Psi,i,low_j,high_j,2);
}

the kernel is
global void kernel(double Psi,int i,int low_j,int high_j,int l)
{
int j = low_j + blockIdx.x
blockDim.x + threadIdx.x;

if (j<=high_j)
{
	int oldi= (i-j);
	double u=oldi*Deltau;
	double v = j*Deltav;
	int index1=oldi*N+j;
	int index2=oldi*N+j-N;
	int index3=oldi*N+j-1;
	int index4 = oldi*N+j-1-N;
	
	double rstar = (v-u)/2.0;
	double rcoord = r(rstar);			
		
	
	//compute Psi_N
	//find next point in diamond
	Psi[index1]=Psi[index2]+Psi[index3]-Psi[index4]-Deltau*Deltav*(Psi[index3]+Psi[index2])/8.0;
}

}

This code runs about 5 times slower than a CPU version. However if I change the kernel variable oldi from i-j to i, it runs 10-100 times faster than the CPU version. Why is that? How can I make it run as fast using i-j rather than i?

Thanks

deb

Hi,

The psi[index1] = psi[index2]… is obviously not coalesced - you should either load things into shared memory or use textures.

As for the oldi change, I guess when you change it to i only - this makes the indexes and thus the memory access pattern fixed for all threads within the block.

This has the effect of the entire block only doing one (or few: index2, index3, index4…) global memory accesses via broadcasting, instead of many un-coallesced accesses.

eyal

Thanks Eyal. Yes you are most likely right with the need for loading onto shared/texture memory as far as the array access, but I still get the hit when I change the kernel to the following

global void kernel(double *Psi,int i,int low_j,int high_j,int l)

{

int j = low_j + blockIdx.x*blockDim.x + threadIdx.x;

if (j<=high_j)

{

int oldi= (i-j);

double u=oldi*Deltau;

double v = j*Deltav;

int index1=(i-j)*N+j;

int index2=(i-j)*N+j-N;

int index3=(i-j)*N+j-1;

int index4 = (i-j)*N+j-1-N;

double rstar = (v-u)/2.0;

double rcoord = r(rstar);

//compute Psi_N

//find next point in diamond

Psi[index1]=Psi[index2]+Psi[index3]-Psi[index4]-DeltauDeltav(Psi[index3]+Psi[index2])/8.0;

}

}

Notice how I am really not using oldi other than to compute v. If I keep as it is, the code will be 5 times slower than the CPU version. If however I make oldi=i it becomes 10 times faster. This is what I dont get. In this case it doesnt seem to be a problem of not loading into shared memory since oldi is not being used in the indices.

Thanks again

deb

Checky your register allocation.

Hi Lev,

Can you explain what do you mean by “check your register allocation”?

thanks

What is r in: “double rcoord = r(rstar);” ??

Also - when you make the change to oldi=i do you get valid result? did you check your kernel return value after calling it?

I guess when you make the change the compiler is able to make some optimization/throw dead code/… and thus the speed change- you just need to figure out why? :) that’s why

I asked about the r function.

BTW, Lev, telling the OP to check for register usage without any reason/explaination is not much help IMHO…

eyal

r(rstar) is a device function that is being called. No, changing from i-j to i I dont get a valid result. I just noticed this runtime hit when I did this change and I dont get it why will not perform roughly the same. I also thought the compiler is doing some optimization when I use i but I thought if I put i-j in the register oldi, the compiler should be able to do the same optimization. But apparently not

can you post the r(rstar) function? The best way to find the reason why the code behaves like that, in my experience, is to try and remove code out and compare results (make sure

that the compiler didnt optimize out the entire kernel and check that the kernel did indeed run and didnt crush). By removing/changing the code you might be able to find

the reason for this speed change.

eyal

Thanks Eyal. Until now I havent thought the problem might be in the r(rstar) function but after your post I will follow your suggestion of remving/changing code in it. The function is this (rstartrans, reps are constants defined in a header file)

device double r(double rstar)

{

    double riter = rstar;

    double rfinal = 0.0;

    if (rstar<rstartrans)

    {

            riter = 0.1;

    }

    while (true)

    {

            if (rstar<rstartrans)

            {//for small rstar iterate equation r_1=exp(rstar-1-r_1), r_1=r-1

                    rfinal= exp(rstar-1-riter);

if (fabs(rfinal-riter)<reps)

                            break;

            }

            else

            {//for large rstar iterate equation r=rstar-ln(r-1)

                    rfinal=rstar-log(riter-1);

if (fabs(rfinal-riter)<reps)

                            break;

            }

            riter = rfinal;

    }

    if (rstar<rstartrans)

            return riter+1.0;

    else

            return riter;

}

With oldi =i-j compiler needs additional registers. If you set, for example, 16 registers, you may run out of this number and perform expensive register splitting to memory. Need to know is register number changed.
Ah, with r star you most likely go to slow path of mathematical functions, log exp etc. And whole function rstar may take more time. Btw, are you sure you need full double precision?
And where is rstar result used?
And with oldi =i you may get much less branch divergence.

Have you tried making the same change in the CPU version and measuring the performance? You might be surprised…

And to actually speed up the GPU calculation, you might try to start the iteration using faster single precision arithmetics. Something along the lines of (completely untested!)

[codebox]

const float reps = 1e-5f;

device double r(double rstar)

{

float riter, rfinal, mix;

float rstar_f = __double2float_rn(rstar);



if (rstar<2.0)

{//for small rstar iterate equation r_1=exp(rstar-1-r_1), r_1=r-1

	riter = 0.1f;

	mix = 0.125*rstar_f+0.25;

	if (mix<0.0f)

		mix = 0.0f;

	while (true)

	{

		rfinal = expf(rstar_f-1-riter);

		

		if (fabsf(rfinal-riter)<reps)

			break;

		riter = mix * riter + (1-mix) * rfinal;

	}

	return exp(rstar-1-rfinal) + 1.0;

}

else

{//for large rstar iterate equation r=rstar-ln(r-1)

	riter = rstar_f;

	mix = 1.0f / rstar_f;

	while (true)

	{

		rfinal = rstar_f - __logf(riter-1);

		

		if (fabsf(rfinal-riter)<reps)

			break;

		riter = mix * riter + (1-mix) * rfinal;

	}

	return rstar - log(rfinal-1);

}

}

[/codebox]

Or, better yet, just use a fixed number of iterations (the warp will have to wait for the slowest converging thread anyway):

[codebox]

const int iter = 5;

device double r(double rstar)

{

float riter, rfinal, mix;

float rstar_f = __double2float_rn(rstar);



if (rstar<2.0)

{//for small rstar iterate equation r_1=exp(rstar-1-r_1), r_1=r-1

	riter = 0.1f;

	mix = 0.125*rstar_f+0.25;

	if (mix<0.0f)

		mix = 0.0f;

#pragma unroll

	for (int i=0; i<iter; i++)

	{

		rfinal = expf(rstar_f-1-riter);

		

		riter = mix * riter + (1-mix) * rfinal;

	}

	return exp(rstar-1-riter) + 1.0;

}

else

{//for large rstar iterate equation r=rstar-ln(r-1)

	riter = rstar_f;

	mix = 1.0f / rstar_f;

#pragma unroll

	for (int i=0; i<iter; i++)

	{

		rfinal = rstar_f - __logf(riter-1);

		

		riter = mix * riter + (1-mix) * rfinal;

	}

	return rstar - log(riter-1);

}

}

[/codebox]

DISCLAIMER: Completely untested. I haven’t checked how __double2float_rn() behaves with doubles that cannot be represented as floats, so another conditional might be needed here. OTOH, your routine apparently doesn’t have to deal with extremely large numbers anyway.

Thanks to all for the many ideas. Here are some things I found/answers:

  1. I need to use double precision throughout (otherwise errors will propagate quickly)

  2. The issue does not appear to have to do with running out of registers. I have used just two registers and I get the same runtime numbers (roughly)

  3. Thanks Tera for your optimization to the r(rstar) device function. I will definetly try it out. But the main slowdown does not appear to be coming from it.

  4. I found the piece of code that is causing the slowdown. I hadnt payed attention to it until Eyal pointed that the problem might be in a device function. Here it is the complete kernel function. Notice how there is
    a variable Vl (that I didnt show in previous posts because I didnt think it would cause any issue;).

  5. It is this Vl that is causing the slowdown. If in the code below I make this single change:

double u = i*Deltau;

the code will run about 40 times faster

  1. I dont see how it could be due to thread divergence since the divergence is the same.

  2. If I comment out Vl, then I get the same run times whether I use double u=i*Deltau or double u=(i-j)*deltau. So it cant be r(rstar).
    It must be Vl that is slowing it down then. It has several double precision divisions by r(rstar). These divisions are different in one and the other case and somehow this causes the issue.


global void kernel(double Psi,int i,int low_j,int high_j,int l)
{
int j = low_j + blockIdx.x
blockDim.x + threadIdx.x;

if (j<=high_j)
{
	double u=(i-j)*Deltau;
	double v = j*Deltav;
	int index1=(i-j)*N+j;
	int index2=(i-j)*N+j-N;
	int index3=(i-j)*N+j-1;
	int index4 = (i-j)*N+j-1-N;
	
	double rstar = (v-u)/2.0;
	double rcoord = r(rstar);			
	double Vl = (1.0-2*M/rcoord)*(l*(l+1)/(rcoord*rcoord)+2*M/(rcoord*rcoord*rcoord));
	

	Psi[index1]=Psi[index2]+Psi[index3]-Psi[index4]-Deltau*Deltav*(Psi[index3]+Psi[index2])/8.0*Vl;
}

I think, you confuse cuda related issues and algorithm related. Btw, you did not tell, how many registers and local memory program are used in each case. Maybe you need to increase register limit. Btw, your program will not run very fast anyway, cause you mass too little amount of threads
dim3 dimBlock(400,1,1); dim3 dimGrid(10,1,1);
to load gpu. you can only load 5-10 streaming multiprocessors of 30.

Hi Lev,

As you can see from my code I am not using local memory at all (not yet). I doubt it is a problem of running out of registers. In the kernel function above I have eliminated the thread variables (registers) and the runtime is about the same.

I think my puzzlement is that if I run these two loops on the cpu (with i or i-j) I get the same runtime. But if I do it on the GPU I observe this difference in runtime.

Ahh, that’s one thing I haven’t though of: The CPU is less sensitive to singularities where the iteration takes very long. On the GPU the influence is larger however, as a whole warp keeps iterating even if only one thread has not yet converged.

I’d still like to see what happens with my optimized routines or if you use a fixed number of iterations.

Hi Tera,

I think you most likely are right. This must be an issue tied to the behavior of the warp in the two cases. I will post what I observe/get after doing the changes you suggested.

"As you can see from my code I am not using local memory at all (not yet). "
you may use local memory… sin and cos and exp sometimes use it. And if you put small number of registers you are using a lot of local memory for register spilling. How many registers does your kernel use in both cases? If it has 128 available registers.

Hi Lev,

Here it is the info I get when compiling the i-j (the one with slow performance)

ptxas info : Compiling entry function ‘kernel’

1>ptxas info : Used 23 registers, 20+16 bytes smem, 88 bytes cmem[1]

and when using i(faster)

ptxas info : Compiling entry function ‘kernel’

1>ptxas info : Used 22 registers, 20+16 bytes smem, 88 bytes cmem[1]

I assume smem is shared memory and cmem is constant memory? What does this tell me about performance?

Thanks

deb

Looks like registers are not an issue here. So most likely algorithm time itself was changed. Btw, will you observe difference in runtime if you swith to floats and fast math functions?