the kernel is global void kernel(double Psi,int i,int low_j,int high_j,int l)
{
int j = low_j + blockIdx.xblockDim.x + threadIdx.x;
if (j<=high_j)
{
int oldi= (i-j);
double u=oldi*Deltau;
double v = j*Deltav;
int index1=oldi*N+j;
int index2=oldi*N+j-N;
int index3=oldi*N+j-1;
int index4 = oldi*N+j-1-N;
double rstar = (v-u)/2.0;
double rcoord = r(rstar);
//compute Psi_N
//find next point in diamond
Psi[index1]=Psi[index2]+Psi[index3]-Psi[index4]-Deltau*Deltav*(Psi[index3]+Psi[index2])/8.0;
}
}
This code runs about 5 times slower than a CPU version. However if I change the kernel variable oldi from i-j to i, it runs 10-100 times faster than the CPU version. Why is that? How can I make it run as fast using i-j rather than i?
The psi[index1] = psi[index2]… is obviously not coalesced - you should either load things into shared memory or use textures.
As for the oldi change, I guess when you change it to i only - this makes the indexes and thus the memory access pattern fixed for all threads within the block.
This has the effect of the entire block only doing one (or few: index2, index3, index4…) global memory accesses via broadcasting, instead of many un-coallesced accesses.
Thanks Eyal. Yes you are most likely right with the need for loading onto shared/texture memory as far as the array access, but I still get the hit when I change the kernel to the following
global void kernel(double *Psi,int i,int low_j,int high_j,int l)
{
int j = low_j + blockIdx.x*blockDim.x + threadIdx.x;
Notice how I am really not using oldi other than to compute v. If I keep as it is, the code will be 5 times slower than the CPU version. If however I make oldi=i it becomes 10 times faster. This is what I dont get. In this case it doesnt seem to be a problem of not loading into shared memory since oldi is not being used in the indices.
Also - when you make the change to oldi=i do you get valid result? did you check your kernel return value after calling it?
I guess when you make the change the compiler is able to make some optimization/throw dead code/… and thus the speed change- you just need to figure out why? :) that’s why
I asked about the r function.
BTW, Lev, telling the OP to check for register usage without any reason/explaination is not much help IMHO…
r(rstar) is a device function that is being called. No, changing from i-j to i I dont get a valid result. I just noticed this runtime hit when I did this change and I dont get it why will not perform roughly the same. I also thought the compiler is doing some optimization when I use i but I thought if I put i-j in the register oldi, the compiler should be able to do the same optimization. But apparently not
can you post the r(rstar) function? The best way to find the reason why the code behaves like that, in my experience, is to try and remove code out and compare results (make sure
that the compiler didnt optimize out the entire kernel and check that the kernel did indeed run and didnt crush). By removing/changing the code you might be able to find
Thanks Eyal. Until now I havent thought the problem might be in the r(rstar) function but after your post I will follow your suggestion of remving/changing code in it. The function is this (rstartrans, reps are constants defined in a header file)
device double r(double rstar)
{
double riter = rstar;
double rfinal = 0.0;
if (rstar<rstartrans)
{
riter = 0.1;
}
while (true)
{
if (rstar<rstartrans)
{//for small rstar iterate equation r_1=exp(rstar-1-r_1), r_1=r-1
rfinal= exp(rstar-1-riter);
if (fabs(rfinal-riter)<reps)
break;
}
else
{//for large rstar iterate equation r=rstar-ln(r-1)
rfinal=rstar-log(riter-1);
With oldi =i-j compiler needs additional registers. If you set, for example, 16 registers, you may run out of this number and perform expensive register splitting to memory. Need to know is register number changed.
Ah, with r star you most likely go to slow path of mathematical functions, log exp etc. And whole function rstar may take more time. Btw, are you sure you need full double precision?
And where is rstar result used?
And with oldi =i you may get much less branch divergence.
And to actually speed up the GPU calculation, you might try to start the iteration using faster single precision arithmetics. Something along the lines of (completely untested!)
DISCLAIMER: Completely untested. I haven’t checked how __double2float_rn() behaves with doubles that cannot be represented as floats, so another conditional might be needed here. OTOH, your routine apparently doesn’t have to deal with extremely large numbers anyway.
Thanks to all for the many ideas. Here are some things I found/answers:
I need to use double precision throughout (otherwise errors will propagate quickly)
The issue does not appear to have to do with running out of registers. I have used just two registers and I get the same runtime numbers (roughly)
Thanks Tera for your optimization to the r(rstar) device function. I will definetly try it out. But the main slowdown does not appear to be coming from it.
I found the piece of code that is causing the slowdown. I hadnt payed attention to it until Eyal pointed that the problem might be in a device function. Here it is the complete kernel function. Notice how there is
a variable Vl (that I didnt show in previous posts because I didnt think it would cause any issue;).
It is this Vl that is causing the slowdown. If in the code below I make this single change:
double u = i*Deltau;
the code will run about 40 times faster
I dont see how it could be due to thread divergence since the divergence is the same.
If I comment out Vl, then I get the same run times whether I use double u=i*Deltau or double u=(i-j)*deltau. So it cant be r(rstar).
It must be Vl that is slowing it down then. It has several double precision divisions by r(rstar). These divisions are different in one and the other case and somehow this causes the issue.
global void kernel(double Psi,int i,int low_j,int high_j,int l)
{
int j = low_j + blockIdx.xblockDim.x + threadIdx.x;
if (j<=high_j)
{
double u=(i-j)*Deltau;
double v = j*Deltav;
int index1=(i-j)*N+j;
int index2=(i-j)*N+j-N;
int index3=(i-j)*N+j-1;
int index4 = (i-j)*N+j-1-N;
double rstar = (v-u)/2.0;
double rcoord = r(rstar);
double Vl = (1.0-2*M/rcoord)*(l*(l+1)/(rcoord*rcoord)+2*M/(rcoord*rcoord*rcoord));
Psi[index1]=Psi[index2]+Psi[index3]-Psi[index4]-Deltau*Deltav*(Psi[index3]+Psi[index2])/8.0*Vl;
}
I think, you confuse cuda related issues and algorithm related. Btw, you did not tell, how many registers and local memory program are used in each case. Maybe you need to increase register limit. Btw, your program will not run very fast anyway, cause you mass too little amount of threads
dim3 dimBlock(400,1,1); dim3 dimGrid(10,1,1);
to load gpu. you can only load 5-10 streaming multiprocessors of 30.
As you can see from my code I am not using local memory at all (not yet). I doubt it is a problem of running out of registers. In the kernel function above I have eliminated the thread variables (registers) and the runtime is about the same.
I think my puzzlement is that if I run these two loops on the cpu (with i or i-j) I get the same runtime. But if I do it on the GPU I observe this difference in runtime.
Ahh, that’s one thing I haven’t though of: The CPU is less sensitive to singularities where the iteration takes very long. On the GPU the influence is larger however, as a whole warp keeps iterating even if only one thread has not yet converged.
I’d still like to see what happens with my optimized routines or if you use a fixed number of iterations.
I think you most likely are right. This must be an issue tied to the behavior of the warp in the two cases. I will post what I observe/get after doing the changes you suggested.
"As you can see from my code I am not using local memory at all (not yet). "
you may use local memory… sin and cos and exp sometimes use it. And if you put small number of registers you are using a lot of local memory for register spilling. How many registers does your kernel use in both cases? If it has 128 available registers.
Looks like registers are not an issue here. So most likely algorithm time itself was changed. Btw, will you observe difference in runtime if you swith to floats and fast math functions?