Hnakel transformm 300x speed-up on Titan. Is it possible to get more?

Hello,

I wrote a discrete Hankel transform (Hankel Transform -- from Wolfram MathWorld). Basically is just a direct port of the the formula from here GNU Scientific Library — GSL 2.7 documentation.

The basic algorithm can be summarize as a matrixx-vector multiplication. K=HM*R. The matrix HM is constant, but it takes lots of memory, os I am calcalting each element HM[i][j] on the fly.

At this moment the speed-up compared to single-corecpu version is 300. Is it possible to get more performance?

My implementatio is very similar to the N-body algorithm from gpu gems 2. This is the relevant code:

#define tpb 1024
#define bpg 384
#define minnbpsmp 2

__global__ 
__launch_bounds__(tpb,minnbpsmp) 
void gpu_ht_transf_otpp(double *in,double *out,double *mone,double *mtwo, dat *bessolutions,const int nr)
{      
   int i = blockIdx.x*tpb+threadIdx.x;
   double aa;
   aa=0.0;
   int j;
   double hcij;
   __shared__ dat bessol[tpb];
   __shared__ double innn[tpb];
   __shared__ double moneee[tpb];
   
   double mybessol,mymtwo,maxbessol;
   //if(i<nr)
   {
      mybessol=bessolutions[i];
      mymtwo=mtwo[i];
   }
      maxbessol=bessolutions[nr];
   for(int p=0;p<bpg;p++)
   {
      j=p*tpb+threadIdx.x;

            bessol[threadIdx.x]=bessolutions[j];
            innn[threadIdx.x]=in[j];
            moneee[threadIdx.x]=mone[j];

      __syncthreads();
//#pragma unroll     10
      for(int jt=0;jt<tpb; jt++)
      {
            j=p*tpb+jt;
                   hcij=((2.0/maxbessol)*j0(bessol[jt]*mybessol/maxbessol))
                        /(fabs(j1(bessol[jt]))*fabs(j1(mybessol)))*innn[jt]*mymtwo/moneee[jt];
                  aa+=hcij;
 
      }
      __syncthreads();
   }

      out[i]=aa;
}
// nr=bpg*tpb

 gpu_ht_transf_otpp < < < bpg,tpb > > >(d_rfhat,d_kfhat,mvectors.mone.device,mvectors.mtwo.device, mvectors.bessolutions.device,nr);

I attached also the code which can be compiled and ran. I does a simple test with 2 transform. The cpu test is commented because is very slow.

Here is the whole code.

Thanks for posting code, this looks great!

As far as a speed-up, does it need to calculate in 64-bit ? Obviously you would get a massive speedup going to 32 bit.

From looking at the code, it would seem to be compute limited on account of the Bessel functions j0() and j1() which are very expensive to compute. Have you had a chance to try CUDA 6.5? Many double-precision math functions have been improved in CUDA 6.5, although I don’t know whether this extends to the Bessel functions. The comments in the code indicate that the compilation target is sm_30, although Titan is an sm_35 device. Is there a particular reason for this?

As a micro optimization, you could try pre-computing 1.0/maxbessol in the following:

hcij=((2.0/maxbessol)*j0(bessol[jt]*mybessol/maxbessol))

With the exception of FMA contraction, the CUDA compiler does not re-associate floating-point computation so as to preserve the numerical properties of code. Changing the two divisions to a multiplication with the reciprocal would likely have a negligible effect on this computation, but it’s impossible to tell from just looking at the code.

By the same token, you might try precomputing 1.0/moneee, then multiply by that instead of dividing.

Hello,

Thanks for the reply. I am not sure why I used 30, maybe because I wrote on my laptop (only 30x speed-up :( ), but I will try the 35 flag as well.

I will to precompute the division operations. My first thought was that all this divisions are somehow optimized and also it would matter compared to the memory loads. My original code was supposed to be as simple as possible, but I think that now after I saw it works, I can do more changes

I will try the float precision as well. My first attempt failed, but after a few hours I found a bug so maybe the single precision will work. On the cpu using single vs double precision gives a little different results, but the thing is so slow I was not able to do a proper comparison. In the programming guide (appendinx D) it is indicated a 1.0e-6 error for single precision j0f and j1f, and a 1.0e-12 for the double precision, this is quite significant, but maybe it will work. A little “numerical noise” sometimes helps.

Just a few words about the algorithm:

I have to solve a equation of the form g(r)=exp(-u(r) + c(r) +b(r)) ( Hyper-netted chain page on SklogWiki - a wiki for statistical mechanics and thermodynamics ). The unknowns are g(r) and c(r), but they are connect in k space. So at every iteration I need to do a Hankel transform forward and 1 backwards.

There are 2 ways to do the Hankel transform. One is to rewrite the transform as a 1D Fourier transform by changing the coordinate, so-called logfft. This is very fast, but there are some funny unphysical oscillations appearing gin the solution due to convolutions.
The second way is to do a brute force integration. This way is important for some problems where we need to conserve the energy.

So even if there is a very fast way to d Hankel transform there are still some problems where one has to do this ‘slow’ transform. I would be happy to share once I optimized it share the code. This is a rather common operations and a 300 speed-up is already a lot.

Cristian

I am not familiar with the details of the Bessel function implementation, but from what I understand the accuracy provided by the CUDA implementation is competitive with implementations of j0() and j1() found in other libraries. The CUDA documentation typically states the worst case error found during testing, the average error is likely quite a bit smaller.

As I understand it, due to the oscillating nature of these functions relative accuracy deteriorates around the zeroes of the functions, and absolute accuracy diminishes as the magnitude of the argument increases. There is no easy solution to this problem as there is for trigonometric functions. A common approach is to factor out a certain number of zeroes to keep results reasonably accurate for smaller arguments, but this technique has scaling issues.

Of course it wouldn’t hurt to have a high-precision (that is, more than double precision) reference function available against which you can check your final results computed with CUDA. Independent of the issue of Bessel function accuracy, the use use of single precision may be an issue simply because summing many single-precision numbers does not leave enough “good bits”. An accurate reference function will allow measuring the overall error for both SP and DP implementation.

You did not mention what CUDA version you are using. Based on the improvements to math library and tool chain for CUDA 6.5 I would strongly suggest giving that a try, if you are not already using it.

While the kernel looks compute bound, you may also want to try optimizing the memory accesses in the code by declaring the pointer arguments to the kernel as restrict and const restrict.

It will be interesting to learn what additional speedup you were able to achieve once optimization is complete.

Hello,

Thank you or your replies. I am using CUDA 5.5 on both my laptop and my desktop computer.

My original implementation was a port of a cpu (MATLAB/FORTRAN/C) code which was written as a matrix vector multiplication. When I increased the system size I had to recalculate the Hnakel matrix every iteration, butI failed ot noticed that I can improve it. After youor suggestion I rewrote my equations so that now there is only 1 Bessel evaluation:

hcij=((2.0)*j0(bessol[jt]*mybessol))*innn[jt]*sh_monetwo[jt];

This does two good things in the same time. Better precision (now at the same level as cpu version) and also a speed-up. By doing this the gpu code got 3.5 faster. Now my gpu code is 1000x faster than my original cpu code, but the last optimization which gave me 3.5 improvement should give similar improvement on cpu.

Now I am not really worry about the precision. It should be the same level as the cpu code.

With this speed-up I do not think is worth it to spend more time on optimizations. I am only curious what is the effect of restrict and const restrict. If my kernel definition is like this:

__global__ 
__launch_bounds__(tpb,minnbpsmp) 
void gpu_ht_transf_otpp(double *in,double *out,double *invmonetwo,double2 *monetwo, dat *bessolutions,const int nr)

where should I add the restricted attributes?

I have another question about the code.

The code I just posted has 3 accesses to shared memory with the same index in the same warp. Does this create bank conflicts?

P.S.

Using the j0f function instead of j0 makes the code on the 30 card about 9 x faster while on the Titan card the improvement is minimal for the sizes I tried, but the precision loss is quite large. Since all my runs will be on the 35 architecture I am content with the double precision functions.

Using 3.5 vs 3.0 on Titan gives a 5-10% decrease in execution time.

Based on a quick perusal of the code snippet above, the function signature with ‘const’ and ‘restrict’ would want to look like so:

void gpu_ht_transf_otpp (const double * __restrict__ in, 
                         double * __restrict__ out, 
                         const double * __restrict__ mone,
                         const double * __restrict__ mtwo,
                         const dat * __restrict__ bessolutions, 
                         const int nr)

restrict is really a contract between the programmer and the compiler, in that the programmer asserts that each pointer points to a separate, non-overlapping data object. This allows the compiler greater flexiblity in the ordering of loads and stores due to the absence of aliasing. The combination of ‘const’ and ‘restrict’ also allows the compiler to generate LDG instead regular loads in newer architectures that support that instruction.

As to why the code is a bit slower when being compiled for sm_35, I can only speculate. Relative immaturity of the sm_35 code generation in CUDA 5.5 perhaps. Friction losses between launch_bounds and sm_35 code generation are also possible. What happens without the use of launch_bounds? In general, for sm_35 compared to sm_30 the compiler tends to tweak code such that it uses fewer instructions but more registers, because sm_35 devices offer more registers per thread block. You may want to play with your thread block organization, based on the register usage the compiler wants to use in the absence of launch_bounds, as the compiler heuristic for register use is pretty mature these days.

Now may be a good time to run the code with the profiler to see whether it can pinpoint any non-obvious bottlenecks. The fact that after restructuring of the code there is little difference between the single-precision and double-precision versions on Titan suggests that the kernel may no longer be compute bound.

The -arch=sm_35 is about 5-10 % faster than -arch=sm_30 :). So there is improvement. The Xptxas gives 8 bytes spill, so probably removing the launch bonds will not be beneficial.

Thanks for the information about the restrict attributes. I added the restrict to my function and I made a quick test for a large system, but there was no difference in the execution time. At this moment there there is no more obvious bottleneck, so the profiler is the next on the list.

My last test I did was with 921600 grid points. This is relative large system and it s what I aim. On Titan, with this system size I get 230 s for 2 transforms with or without the restrict (double precision).

I tested the code with different values for the threads per block, 256:512:1024. No change in execution time.

Hello,

I have added the code to the gihub. I hope it works.

Cristian

Hello,

A fast check with the profiler indicated a compute bound kernel with 98 % divergence. I suspect that the problem comes from the Bessel function calculation, which I have no more control.

Last update. Switching from cuda 5.5 to cuda 6.5 changed the execution time for the biggest size I tried from 720 seconds to 673 seconds. This is 8 % improvement by doing almost nothing :)