cufft questions

WimpyOne · March 20, 2012, 12:06am

What is the most appropriate place to ask questions regarding cufft?

I have been pulling out my hair on some questions that might be extremely simple - “cufft for dummies” type of manual. This has to be user error on my part. I have read thru the docs and I am trying to follow thru the convolution2D example. The theory seems understandable, but I am having some trouble regarding a few items:

the order to populated the convolution kernel and the image data - row major, col major? I have gotten some of the expected values for the resulting pixels, but in the wrong position/order in the resulting array brought back from the GPU.
the amount of padding required in some circumstances, and the best way to avoid “wrap around” effect
the appropriate way to control the offsets of the kernel to position the convolution kernel in the desired place in the image data
suggestions regarding techniques for large dimensions ( possibly greater than 10,000 x 10,000 )
any guidance for how to create a CPU only version that still does FFT rather than “brute force”. It sounded like cufft is based on the FFTW libraries found at FFTW.org. How easy would it be to prepare the data and convolution kernel once and run it thru both cufft and fftw?

pasoleatis · March 20, 2012, 1:44pm

The size is limited by the memory. Here is a code which does a convolution for real matrix , but I have few comments. I use in-place transforms. I can answer to your questions:

#include <cstdio>

#include <cmath>

#include <iostream>

#include "error_checks.h" // Macros CUDA_CHECK and CHECK_ERROR_MSG

#include <cuda.h>

#include <cuda_runtime.h>

#include <cufft.h>

__global__ void kexcess(cufftDoubleComplex *A,const double *B, int totsize_inspa,int totsize,

const double ddB,const double BBx)

{

//    int ind = blockIdx.x * blockDim.x + threadIdx.x;

    int ibl=blockIdx.y+blockIdx.x*gridDim.y;

    int ind=threadIdx.x+blockDim.x*ibl;

    double ccc,qq;

    if(ind<totsize_inspa)

    {

    qq=B[ind];

    ccc=(ddB+BBx+BBx*(2.0*qq+qq*qq))/((double) totsize);

    A[ind].x=A[ind].x*ccc; 

    A[ind].y=A[ind].y*ccc;

}

}

__global__ void kupdt(cufftDoubleComplex *A,cufftDoubleComplex *B,const double *C,

int totsize_inspa,int totsize,const double ddB,const double BBx,double ddt)

{    

//    int ind = blockIdx.x * blockDim.x + threadIdx.x;

    int ibl=blockIdx.y+blockIdx.x*gridDim.y;

    int ind=threadIdx.x+blockDim.x*ibl;

    double qq,fff,ccc;

    if(ind<totsize_inspa)

    {

    qq=C[ind];

    ccc=ddB+BBx+BBx*(2.0*qq+qq*qq);

    fff=(1.0/(1.0-ddt*ccc*qq))/((double) totsize);

    A[ind].x=A[ind].x*fff+B[ind].x*qq*ddt*fff; 

    A[ind].y=A[ind].y*fff+B[ind].y*qq*ddt*fff;

    }   

}    

__global__ void totene(cufftDoubleReal *A,cufftDoubleReal *C, int totsize_pad,

double VVnot,cufftDoubleReal *P)

{

//    int ind = blockIdx.x * blockDim.x + threadIdx.x;

    int ibl=blockIdx.y+blockIdx.x*gridDim.y;

    int ind=threadIdx.x+blockDim.x*ibl;

    double ppp;

    if(ind<totsize_pad)

   {

    ppp=C[ind];

    A[ind]=0.5*A[ind]*ppp+ppp*ppp*ppp*ppp/12.0-ppp*ppp*ppp/6.0+VVnot*P[ind]*ppp;

    }

}

__global__ void nonlinterm(double *A,double *C, int totsize_pad,

double VVnot,cufftDoubleReal *P)

{

//    int ind = blockIdx.x * blockDim.x + threadIdx.x;

    int ibl=blockIdx.y+blockIdx.x*gridDim.y;

    int ind=threadIdx.x+blockDim.x*ibl;

    double ppp;

    if(ind<totsize_pad)

   {

    ppp=C[ind];

    A[ind]=ppp*ppp*ppp/3.0-ppp*ppp/2.0+VVnot*P[ind]; 

    }

}

__host__ double energy(cufftDoubleReal *dbbff,cufftDoubleReal *dppsi, double *ddqq,

cufftDoubleReal *hbbff,cufftDoubleReal *hppsi,

int llx,int lly,int totsize,int totsize_pad,int totsize_invspa,

double ddB,double BBx,cufftHandle pprc,cufftHandle ppcr,dim3 ggrid, dim3 tthreads,

double VVnot,cufftDoubleReal *dppin)

{

CUDA_CHECK( cudaMemcpy(dbbff, dppsi, sizeof(double)*totsize_pad,cudaMemcpyDeviceToDevice) );

    cudaThreadSynchronize();

    cufftExecD2Z(pprc,dbbff,(cufftDoubleComplex*)dbbff);

    cudaThreadSynchronize();    

    kexcess<<<ggrid,tthreads>>>((cufftDoubleComplex*)dbbff,ddqq, totsize_invspa,totsize,ddB,BBx);

    cufftExecZ2D(ppcr,(cufftDoubleComplex*)dbbff, dbbff);

    totene<<<ggrid,tthreads>>>(dbbff,dppsi, totsize_pad,VVnot,dppin); 

    cudaThreadSynchronize();

    CUDA_CHECK( cudaMemcpy(hbbff, dbbff, sizeof(double)*totsize_pad,cudaMemcpyDeviceToHost) );  

    cudaThreadSynchronize();

    CUDA_CHECK( cudaMemcpy(hppsi, dppsi, sizeof(double)*totsize_pad,cudaMemcpyDeviceToHost) ); 

double enne=0;

   double  ppmm=0; 

   int count=0;   

for(int i=0;i<llx;i++)

    { 

    for(int j=0;j<2*(lly/2+1);j++)

    { 

    if(j<lly)

    {

enne=enne+hbbff[count];

    ppmm=ppmm+hppsi[count];

    }

    count=count+1;  

    }}  

//    printf(" %66.60lf %66.60lf\n", enne/(totsize), ppmm/(totsize)); 

//    printf("Average density is %66.60lf\n", ppmm/(totsize)); 

   return enne/(double) totsize;

}

__host__ void update(cufftDoubleReal *dbbff,cufftDoubleReal *dppsi, double *ddqq,

int llx,int lly,int totsize,int totsize_pad,int totsize_invspa,

double ddB,double BBx,double ddt,

cufftHandle pprc,cufftHandle ppcr,dim3 ggrid, dim3 tthreads,

double VVnot,cufftDoubleReal *dppin)

{ 

    CUDA_CHECK( cudaMemcpy(dbbff, dppsi, sizeof(double)*totsize_pad,cudaMemcpyDeviceToDevice) );

    cudaThreadSynchronize(); 

    nonlinterm<<<ggrid,tthreads>>>(dbbff,dppsi, totsize_pad,VVnot,dppin); 

    cudaThreadSynchronize();

    cufftExecD2Z(pprc,dbbff,(cufftDoubleComplex*)dbbff); 

    cudaThreadSynchronize();

    cufftExecD2Z(pprc,dppsi,(cufftDoubleComplex*)dppsi);

    cudaThreadSynchronize();

    kupdt<<<ggrid,tthreads>>>((cufftDoubleComplex*)dppsi,(cufftDoubleComplex*)dbbff,

    ddqq,totsize_invspa,totsize,ddB,BBx,ddt);

    cudaThreadSynchronize();

    cufftExecZ2D(ppcr,(cufftDoubleComplex*)dppsi,dppsi);

    cudaThreadSynchronize();    

}    

int main(void)

{

   cudaSetDevice(1);

   size_t free, total;

   printf("\n");

   cudaMemGetInfo(&free,&total);   

   printf("%d KB free of total %d KB at the beginning\n",free/1024,total/1024);

const int lcellx=16;

   const int ncellx=128; //384; //,ncelly=ncellx,ncellz=ncellx;

   const int lx=ncellx*lcellx,ly=lx;

   const int totsize=lx*ly,totsize_pad=lx*2*(ly/2+1),totsize_invspa=lx*(ly/2+1);

   const double Pi=acos(-1.0);

   const double dt=0.5;

   const double Vmax=0.013,Vmin=0.0;

   const double at=1.0023048*7.255197456936871; //7.255197456936871;

   const double alpha=0.945; //60.0/64.0; // (double)362/(double)384;

   const double ap=at/alpha;

   const double qt=2*acos(-1.0)/at, qp=2*Pi/ap;

   const double dx=ap/((double)lcellx),dy=dx*sqrt(3.0)/2.0;

//   const double dx=at/((double)lcellx),dy=dx*sqrt(3.0)/2.0;

   const double Bx=1.0,delB=0.02;

   const double radi=30.0;

   const int nend=3000,nmes=500;

   const int nsteps=60;

   int ret;

   static double hppp[lx][ly];

   double pos[ncellx*ncellx*4][2];

   char conf[11],part[11],struc[10];

   FILE * pFile;

   double tmpp;

   int iip,iim,jjp,jjm;

   int cpart;

   const double g1x=(2*Pi/ap);

   const double g1y=-(2*Pi/ap)/sqrt(3);

   const double g2x=-(2*Pi/ap);

   const double g2y=-(2*Pi/ap)/sqrt(3);

   const double g3x=0;

   const double g3y=(2*Pi/ap)/(sqrt(3)/2);

double kx,ky;

   double ene[nmes+1],dpp[nmes+1],ppmm,eeii;  

   clock_t start, end;

   double cpu_time_used;

   double Vnot;

static cufftDoubleReal hbff[totsize_pad],hpsi[totsize_pad],hpin[totsize_pad];

   static cufftDoubleComplex hkpsi[lx][ly/2+1];

   static double hqq[totsize_invspa]; 

   double *dqq;

   cufftDoubleReal *dpsi,*dbff,*dpin;

   cufftHandle prc,pcr;   

   start = clock();

printf("\n");

   printf(" %d KB memory which will be allocated direct by the user on the device. lx= %d ly= %d\n",

   (3*totsize_pad+totsize_invspa)*sizeof(double)/1024,lx,ly); 

   printf("\n");

   cudaMemGetInfo(&free,&total);  

   printf("%d KB free of total %d KB before the cufft plans are made .\n",free/1024,total/1024);

cufftPlan2d(&prc,lx,ly,CUFFT_D2Z);

   cufftPlan2d(&pcr,lx,ly,CUFFT_Z2D);  

   printf("\n");

   cudaMemGetInfo(&free,&total);  

   printf("%d KB free of total %d KB after the cufft plans are made.\n",free/1024,total/1024);

dim3 grid,threads;

threads.x=1024;

   threads.y=1;

   threads.z=1;

   grid.x=(int)ceil((double)lx/sqrt((double)threads.x)); 

   grid.y=(int)ceil((double)2*(ly/2+1)/sqrt((double)threads.x));

   grid.z=1;

printf("\n");

   printf("%d %d %d %d\n",grid.x,grid.y,threads.x*grid.x*grid.y,totsize_pad);

   printf("\n");

   cudaMemGetInfo(&free,&total); 

   printf("%d KB free of total %d KB before allocations.\n",free/1024,total/1024);

CUDA_CHECK( cudaMalloc((void**)&dpsi, sizeof(cufftDoubleReal)*totsize_pad) );

   CUDA_CHECK( cudaMalloc((void**)&dpin, sizeof(cufftDoubleReal)*totsize_pad) );    

   CUDA_CHECK( cudaMalloc((void**)&dbff, sizeof(cufftDoubleReal)*totsize_pad) );  

   CUDA_CHECK( cudaMalloc((void**)&dqq, sizeof(double)*totsize_invspa) );  

printf("\n");

   cudaMemGetInfo(&free,&total); 

   printf("%d KB free of total %d KB after allocations.\n",free/1024,total/1024);

int count;

    count=0;

    ppmm=0; 

    for(int i=0;i<lx;i++)

    { 

    for(int j=0;j<2*(ly/2+1);j++)   

    {

    if(j<ly)

    { 

     hpin[count]=2*(cos(g1x*i*dx+g1y*j*dy)+cos(g2x*i*dx+g2y*j*dy)+cos(g3x*i*dx+g3y*j*dy));

//    hpin[count]=4*(cos(qp*i*dx)*cos(qp*j*dy/sqrt(3.0))+0.5*cos(2*qp*j*dy/sqrt(3.0)));

    }

    else

    {

    hpsi[count]=0;

    }

    count=count+1;  

    }}  

//    printf("%66.60lf\n", ppmm/(double)totsize);

count=0;

    for(int i=0;i<lx;i++)

    {   

    if(i<=lx/2)

    {kx=i*2*Pi/((double)lx*dx);}

    else   

    {kx=(i-lx)*2*Pi/((double)lx*dx);}   

for(int j=0;j<=ly/2;j++)

    {

    ky=j*2*Pi/((double)ly*dy);

hqq[count]=-(kx*kx+ky*ky);

count=count+1;

    }}

CUDA_CHECK( cudaMemcpy(dqq, hqq, sizeof(double)*totsize_invspa,cudaMemcpyHostToDevice) );

    CUDA_CHECK( cudaMemcpy(dpin, hpin, sizeof(double)*totsize_pad,cudaMemcpyHostToDevice) );

double ReA;

    double ImA;

    count=0;

    ppmm=0; 

    for(int i=0;i<lx;i++)

    { 

    for(int j=0;j<2*(ly/2+1);j++)   

    {

    if(j<ly)

    { 

if(pow((i-lx/2)*dx,2)+pow((j-ly/2)*dy,2)<=at*at*radi*radi)

    {

    ReA=0.2067*cos(2*Pi/3); // 0.185558839341884 //15 //0.221536478409509

    ImA=0.2067*sin(2*Pi/3);

    }

    else

    {

    ReA=0.2067*cos(2*Pi/3);

    ImA=0.2067*sin(-2*Pi/3);

    }

//    hpsi[count]=-hpin[count];

    hpsi[count]=2*(ReA*(cos(g1x*i*dx+g1y*j*dy)+cos(g2x*i*dx+g2y*j*dy)+cos(g3x*i*dx+g3y*j*dy))

    -ImA*(sin(g1x*i*dx+g1y*j*dy)+sin(g2x*i*dx+g2y*j*dy)+sin(g3x*i*dx+g3y*j*dy)));

ppmm=ppmm+hpsi[count];

    }

    else

    {

    hpsi[count]=0;

    }

    count=count+1;  

    }} 

    ppmm=ppmm/(double)totsize; 

    printf("%66.60lf\n",ppmm);

count=0;

    for(int i=0;i<lx;i++)

    { 

    for(int j=0;j<2*(ly/2+1);j++)   

    {

    if(j<ly)

    { 

hpsi[count]=hpsi[count]-ppmm;

}

    else

    {

    hpsi[count]=0;

    }

    count=count+1;  

    }} 

pFile=fopen("initpsione.txt","w");    

    count=0;

    for(int i=0;i<lx;i++)

    {

    for(int j=0;j<2*(ly/2+1);j++)

    {

    if(j<ly)

    {

    fprintf(pFile,"%61.56lf\n",hpsi[count]);

    }

    count=count+1;

    }}

    fclose(pFile);  

    CUDA_CHECK( cudaMemcpy(dpsi, hpsi, sizeof(double)*totsize_pad,cudaMemcpyHostToDevice) ); 

ret=sprintf(conf,"psi000.txt");

    pFile=fopen(conf,"w"); 

    count=0;

    for(int i=0;i<lx;i++)

    {

    for(int j=0;j<2*(ly/2+1);j++)

    {

    if(j<ly)

    {

    fprintf(pFile,"%61.56lf\n",hpsi[count]);

    }

    count=count+1;

    }}

    fclose(pFile); 

Vnot=Vmax;

    eeii=energy(dbff,dpsi,dqq,hbff,hpsi,lx,ly,totsize,totsize_pad,totsize_invspa,delB,Bx,

    prc,pcr,grid,threads,Vnot,dpin);

printf("At starting  energy is %66.60lf\n",eeii);

ene[0]=energy(dbff,dpsi,dqq,hbff,hpsi,lx,ly,totsize,totsize_pad,totsize_invspa,delB,Bx,

   prc,pcr,grid,threads,Vnot,dpin);

   pFile=fopen("enespeed.txt","a");

   fprintf(pFile,"%66.60lf %66.60lf %66.60lf\n",Vnot,ene[0],dpp[0]); 

   fclose(pFile);  

for(int sss=1;sss<=nmes;sss++)

    {

//    Vnot=Vmin+(double)sss*(Vmax-Vmin)/(double)nsteps;

    for(int n=1;n<=nend;n++)

    {

/*    

    eeii=energy(dbff,dpsi,dqq,hbff,hpsi,lx,ly,totsize,totsize_pad,totsize_invspa,delB,Bx,

    prc,pcr,grid,threads,Vnot,dpin);

printf("Before starting %d energy is %66.60lf\n",sss,eeii); 

*/

// Start update

   update(dbff,dpsi,dqq,lx,ly,totsize,totsize_pad,totsize_invspa,delB,Bx,dt,prc,pcr,grid,threads,

   Vnot,dpin);   

// End update

    }

CUDA_CHECK( cudaMemcpy(hpsi, dpsi, sizeof(double)*totsize_pad,cudaMemcpyDeviceToHost) ); 

dpp[sss]=0;

/*

    for(int n=1;n<=nav;n++)

    {  

    count=0;

    for(int i=0;i<lx;i++)

    {

    for(int j=0;j<2*(ly/2+1);j++)

    {

    if(j<ly)

    {

    hbff[count]=hpsi[count];

    }

    count=count+1;

    }}

update(dbff,dpsi,dqq,lx,ly,totsize,totsize_pad,totsize_invspa,delB,Bx,dt,prc,pcr,grid,threads,

   Vnot,dpin);

cudaThreadSynchronize();

    CUDA_CHECK( cudaMemcpy(hpsi, dpsi, sizeof(double)*totsize_pad,cudaMemcpyDeviceToHost) );     

    cudaThreadSynchronize();

count=0;

    for(int i=0;i<lx;i++)

    {

    for(int j=0;j<2*(ly/2+1);j++)

    {

    if(j<ly)

    {

    dpp[sss]=dpp[sss]+fabs(hbff[count]-hpsi[count]);

    }

    count=count+1;

    }}

}

*/    

// Start of energy function    

   ene[sss]=energy(dbff,dpsi,dqq,hbff,hpsi,lx,ly,totsize,totsize_pad,totsize_invspa,delB,Bx,

   prc,pcr,grid,threads,Vnot,dpin);

//   dpp[sss]=dpp[sss]/(dt*(double)nav*(double)totsize);

   pFile=fopen("enespeed.txt","a");

   fprintf(pFile,"%66.60lf %66.60lf %66.60lf\n",Vnot,ene[sss],dpp[sss]); 

   fclose(pFile);    

// End energy calculation 

//    printf("%d \n",sss); 

    if(sss<10)

    {

    ret=sprintf(part,"ppp00%d.txt",sss);

    ret=sprintf(conf,"psi00%d.txt",sss);

    ret=sprintf(struc,"sk00%d.txt",sss);

    }

    else if(sss<100)

    {

    ret=sprintf(part,"ppp0%d.txt",sss);

    ret=sprintf(conf,"psi0%d.txt",sss);

    ret=sprintf(struc,"sk0%d.txt",sss);

    }

    else if(sss<1000)

    {

    ret=sprintf(part,"ppp%d.txt",sss);

    ret=sprintf(conf,"psi%d.txt",sss);

    ret=sprintf(struc,"sk%d.txt",sss);

    } 

//    printf("%s %s\n", conf,struc); 

    pFile=fopen(conf,"w"); 

    count=0;

    for(int i=0;i<lx;i++)

    {

    for(int j=0;j<2*(ly/2+1);j++)

    {

    if(j<ly)

    {

    fprintf(pFile,"%61.56lf\n",hpsi[count]);

    hppp[i][j]=hpsi[count];

    }

    count=count+1;

    }}

    fclose(pFile);

cpart=0; 

    pFile=fopen(part,"w"); 

    for(int i=0;i<lx;i++)

    {

        iip=i+1;

        if(iip<0)

        {

            iip=iip+lx;

        }

        else if(iip>=lx)

        {

            iip=iip-lx;

        }

        iim=i-1;

        if(iim<0)

        {

            iim=iim+lx;

        }

        else if(iip>=lx)

        {

            iim=iim-lx;

        }    

for(int j=0;j<ly;j++)

    {

            jjp=j+1;

            if(jjp<0)

            {

                jjp=jjp+lx;

            }

            else if(jjp>=ly)

            {

                jjp=jjp-ly;

            }

            jjm=j-1;

            if(jjm<0)

            {

                jjm=jjm+lx;

            }

            else if(jjm>=ly)

            {

                jjm=jjm-ly;

            }

if(hppp[i][j]>=hppp[iip][j] & hppp[i][j]>=hppp[iim][j] & 

    hppp[i][j]>=hppp[i][jjp] & hppp[i][j]>=hppp[i][jjm]& 

    hppp[i][j]>=hppp[iim][jjm] & hppp[i][j]>=hppp[iim][jjp] & 

    hppp[i][j]>=hppp[iip][jjm] & hppp[i][j]>=hppp[iip][jjp] )

    {

                pos[cpart][0]=i*dx;

                pos[cpart][1]=j*dy;

                cpart=cpart+1;

                fprintf(pFile,"%66.60lf %66.60lf\n",i*dx,j*dy);

}

    }

    }

    fclose(pFile); 

    printf("%d %d\n",sss,cpart);     

CUDA_CHECK( cudaMemcpy(dbff, dpsi, sizeof(double)*totsize_pad,cudaMemcpyDeviceToDevice) );

    cudaThreadSynchronize();

    cufftExecD2Z(prc,dbff,(cufftDoubleComplex*)dbff); 

    cudaThreadSynchronize();

    CUDA_CHECK( cudaMemcpy(hkpsi,(cufftDoubleComplex*)dbff, sizeof(cufftDoubleComplex)*totsize_invspa,cudaMemcpyDeviceToHost) );

    /*

    pFile=fopen(struc,"w");    

    count=0;

    for(int i=0;i<lx;i++)

    {

    for(int j=0;j<(ly/2+1);j++)

    {

    fprintf(pFile,"%61.56lf\n",(hkpsi[i][j].x*hkpsi[i][j].x+hkpsi[i][j].y*hkpsi[i][j].y)/((double)totsize*(double)totsize));

count=count+1;

    }}

    fclose(pFile); 

    */

//    printf("%s %s %66.60lf \n", conf,struc);

    printf("%s %s %66.60lf \n", conf,struc,ene[sss]); 

//    printf(" %66.60lf %66.60lf\n", ene[sss],dpp[sss]); 

    }

CUDA_CHECK( cudaFree((void*)dpsi) );

    CUDA_CHECK( cudaFree((void*)dbff) );

    CUDA_CHECK( cudaFree((void*)dqq) );

    cufftDestroy(prc);

    cufftDestroy(pcr);

end = clock();

    cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;

    printf("%66.60lf \n",cpu_time_used);

printf("\n");

   cudaMemGetInfo(&free,&total); 

   printf("%d KB free of total %d KB at the end.\n",free/1024,total/1024);

/*   

   pFile=fopen("enespeed.txt","w");

   for(int sss=0;sss<=nsteps;sss++)

   {

    Vnot=Vmin+(double)sss*(Vmax-Vmin)/(double)nsteps;

   fprintf(pFile,"%66.60lf %66.60lf %66.60lf\n",Vnot,ene[sss],dpp[sss]/(dt*(double)nav*(double)totsize));

   } 

    fclose(pFile); 

*/    

    return 0;

}

So for inplace transforms: you have

const int lx=ncellx*lcellx,ly=lx;

   const int totsize=lx*ly,totsize_pad=lx*2*(ly/2+1),totsize_invspa=lx*(ly/2+1);

static cufftDoubleReal hbff[totsize_pad],hpsi[totsize_pad],hpin[totsize_pad];

   static cufftDoubleComplex hkpsi[lx][ly/2+1];

   static double hqq[totsize_invspa]; 

   double *dqq;

   cufftDoubleReal *dpsi,*dbff,*dpin;

   cufftHandle prc,pcr;

The allocation is done like this on the gpu:

CUDA_CHECK( cudaMalloc((void**)&dpsi, sizeof(cufftDoubleReal)*totsize_pad) );

   CUDA_CHECK( cudaMalloc((void**)&dpin, sizeof(cufftDoubleReal)*totsize_pad) );    

   CUDA_CHECK( cudaMalloc((void**)&dbff, sizeof(cufftDoubleReal)*totsize_pad) );  

   CUDA_CHECK( cudaMalloc((void**)&dqq, sizeof(double)*totsize_invspa) );

Here is an example to initilize correct with padding:

double ReA;

    double ImA;

    count=0;

    ppmm=0; 

    for(int i=0;i<lx;i++)

    { 

    for(int j=0;j<2*(ly/2+1);j++)   

    {

    if(j<ly)

    { 

if(pow((i-lx/2)*dx,2)+pow((j-ly/2)*dy,2)<=at*at*radi*radi)

    {

    ReA=0.2067*cos(2*Pi/3); // 0.185558839341884 //15 //0.221536478409509

    ImA=0.2067*sin(2*Pi/3);

    }

    else

    {

    ReA=0.2067*cos(2*Pi/3);

    ImA=0.2067*sin(-2*Pi/3);

    }

//    hpsi[count]=-hpin[count];

    hpsi[count]=2*(ReA*(cos(g1x*i*dx+g1y*j*dy)+cos(g2x*i*dx+g2y*j*dy)+cos(g3x*i*dx+g3y*j*dy))

    -ImA*(sin(g1x*i*dx+g1y*j*dy)+sin(g2x*i*dx+g2y*j*dy)+sin(g3x*i*dx+g3y*j*dy)));

ppmm=ppmm+hpsi[count];

    }

    else

    {

    hpsi[count]=0;

    }

    count=count+1;  

    }} 

    ppmm=ppmm/(double)totsize;

and Here is an sxample how to prepare the k filter:

count=0;

    for(int i=0;i<lx;i++)

    {   

    if(i<=lx/2)

    {kx=i*2*Pi/((double)lx*dx);}

    else   

    {kx=(i-lx)*2*Pi/((double)lx*dx);}   

for(int j=0;j<=ly/2;j++)

    {

    ky=j*2*Pi/((double)ly*dy);

hqq[count]=-(kx*kx+ky*ky);

count=count+1;

    }}

jpmig313 · March 22, 2012, 2:27pm

I am trying to use cufft libraries to do a 1D fft along the 2nd dimension of a 4D matrix.

but I am facing a problem when creating a plan for this. I tried with cufftPlan1d and cufftPlanMany routines.

I would like to create it be something like this as given in matlab:

A = fft(B,,2);

where B is a 4D matrix and I would like to do it in 2nd dimension, and fft is a 1D fft.

I think the parameter I am passing are wrong for cufftPlan1d and cufftPlanMany routines:

cufftPlan1d(&plan, N, CUFFT_C2C, MKL); where the dimension is MNK*L

and cufftPlanMany(&plan, Nrank, ndi, NULL, 1, MKL, NULL, 1, MKL, CUFFT_C2C, MKL) ; where Nrank is 1, ndi has N.

Can anyone please let me know how to pass these parameters.

Thanks in advance.

pasoleatis · March 22, 2012, 2:48pm

Hello,

You can define your matrix as arrays of pointers. For 2 Darray I did somehitng like this: double *host_arr[Ny],*dev_arr[Ny];, then allocate the line by line

for (int is = 0; is < N; is++)

    {

    cudaMalloc(&dev_arr[is],sizeof(double)*Nx);

    cudaHostAlloc(&host_arr[is],sizeof(double)*Nx,cudaHostAllocDefault);

    }

No you can use the fft plan with argument just a line of the matrix. It works for 2D arrays, but I never tried for 4D arrays. In that case is maybe something like double ***host_arr[N4] (I think)

jpmig313 · April 2, 2012, 6:02am

Hello,

You can define your matrix as arrays of pointers. For 2 Darray I did somehitng like this: double *host_arr[Ny],*dev_arr[Ny];, then allocate the line by line
for (int is = 0; is < N; is++)

    {

    cudaMalloc(&dev_arr[is],sizeof(double)*Nx);

    cudaHostAlloc(&host_arr[is],sizeof(double)*Nx,cudaHostAllocDefault);

    }
No you can use the fft plan with argument just a line of the matrix. It works for 2D arrays, but I never tried for 4D arrays. In that case is maybe something like double ***host_arr[N4] (I think)

Dear Pasoleatis,

Thanks for your reply. I was able to solve the problem.

Here i sent the 4D matrix transpose as the input and then did FFT on 1st dimension.

I mean i transposed the first 2 dimensions in the 4D matrix. Hence the second dimension

became the first and then I found the fft, after that i transposed the output again.

Hence it worked.

I have one more question: do you know an equivalent of element by element matrix multiplication routine in cuda (I mean BLAS or LAPACK).

so that i can use it as an equivalent in A.*B matrix matrix element by element multiplication, as it is done in Matlab.

I cannot use the cuda kernel, as in Mex Programming cuda kernel cannot be done.

Thanks in advance.

pasoleatis · April 2, 2012, 8:39am

So far I have not seen any fucntoin in BLAS to do that. I wrote my own kernel because it was simple and it was just 1-2 % of the total work.

Topic		Replies	Views
[SOLVED] cuFFT data storage, maybe I'm operating on the wrong elements CUDA Programming and Performance	16	2114	June 11, 2022
cuFFT and fftw CUDA Programming and Performance	10	4145	August 25, 2010
Questions about cuFFT for 3D matrix, arrayFire GPU-Accelerated Libraries	5	1650	October 12, 2021
Cufft 1D transform CUDA Programming and Performance	6	2996	January 8, 2015
cufft question CUDA Programming and Performance	6	8653	March 9, 2009
2D Convolution problem following example from SDK source code included CUDA Programming and Performance	9	11641	June 7, 2011
CUFFT_INTERNAL_ERROR during creation of a 1D Plan in CUFFT GPU-Accelerated Libraries cuda , cufft	11	3728	October 19, 2022
cufftXt batch 1D GPU-Accelerated Libraries	12	2149	October 15, 2019
CUDA_SAFE_CALL CUDA Programming and Performance	1	7971	November 8, 2010
cufftPlan2d fails CUDA Programming and Performance	14	21012	September 17, 2007

cufft questions

Related topics