Hello,

I wrote a discrete Hankel transform (Hankel Transform -- from Wolfram MathWorld). Basically is just a direct port of the the formula from here GNU Scientific Library — GSL 2.7 documentation.

The basic algorithm can be summarize as a matrixx-vector multiplication. K=HM*R. The matrix HM is constant, but it takes lots of memory, os I am calcalting each element HM[i][j] on the fly.

At this moment the speed-up compared to single-corecpu version is 300. Is it possible to get more performance?

My implementatio is very similar to the N-body algorithm from gpu gems 2. This is the relevant code:

```
#define tpb 1024
#define bpg 384
#define minnbpsmp 2
__global__
__launch_bounds__(tpb,minnbpsmp)
void gpu_ht_transf_otpp(double *in,double *out,double *mone,double *mtwo, dat *bessolutions,const int nr)
{
int i = blockIdx.x*tpb+threadIdx.x;
double aa;
aa=0.0;
int j;
double hcij;
__shared__ dat bessol[tpb];
__shared__ double innn[tpb];
__shared__ double moneee[tpb];
double mybessol,mymtwo,maxbessol;
//if(i<nr)
{
mybessol=bessolutions[i];
mymtwo=mtwo[i];
}
maxbessol=bessolutions[nr];
for(int p=0;p<bpg;p++)
{
j=p*tpb+threadIdx.x;
bessol[threadIdx.x]=bessolutions[j];
innn[threadIdx.x]=in[j];
moneee[threadIdx.x]=mone[j];
__syncthreads();
//#pragma unroll 10
for(int jt=0;jt<tpb; jt++)
{
j=p*tpb+jt;
hcij=((2.0/maxbessol)*j0(bessol[jt]*mybessol/maxbessol))
/(fabs(j1(bessol[jt]))*fabs(j1(mybessol)))*innn[jt]*mymtwo/moneee[jt];
aa+=hcij;
}
__syncthreads();
}
out[i]=aa;
}
// nr=bpg*tpb
gpu_ht_transf_otpp < < < bpg,tpb > > >(d_rfhat,d_kfhat,mvectors.mone.device,mvectors.mtwo.device, mvectors.bessolutions.device,nr);
```

I attached also the code which can be compiled and ran. I does a simple test with 2 transform. The cpu test is commented because is very slow.