How to show CuFFT routines show higher performance than normal MATLAB fft() in terms of time taken.

Hii,
I am new to CUDA programming and currently i am working on a project involving the implementation of CUDA with MATLAB. In particular, i am trying to develop a mex function for computing FFT of any input array and I also got successful in creating such a mex function using the CUFFT library. The function is evaluating the fft correctly for any input array.
But in order to see the advantage of CUFFT over the normal matlab fft(), i operated my newly created CUFFT enabled mex function over a range of input matrices ( from 1000 x 1000 to 10000 x 10000) but each but each time i observed that the time taken by my CuFFT enabled code is greater compared to the time taken by the normal MATLAB fft()(for 1D FFT) . Since i am new to CUDA and GPU computing, therefore in order to see the advantage of GPU and parallel computing over normal CPUs , i must implement it into a system where i can see the difference between the time taken by the two versions of the code. Thus, I request you to please enlighten me about the ideas and ways in which i can implement this CuFFT code to see the difference. Looking forward for your advice. Thank you.

Someone please help me with the above issue.
Does this mean CUFFT is not fast enough!!!

  1. The first call of a CUDA mex from MATLAB will be much slower than subsequent calls. Take an average over 100 times. (This is a MATLAB issue, not a CUDA issue, as there is usually no significant initial overhead when using a straight application).

  2. You may have compiled the mex files using incorrect compute capability or with the -G debug flag. Show your build output. Incorrect compilation will produce poor results.

  3. You do not say which CPU and GPU configuration you are using. If you are using some crappy compute 1.0 GPU and are comparing to an Intel Xeon E5-2687W then that is not a fair comparison.

  4. In general on a high-end PC with a consumer GPU (GTX 780ti) when compared to an I-7 4.3 GHz CPU, I see about a 40x difference in running times for 32-bit FFT via cuFFT when compared to MATLAB.

  5. Also keep in mind you have to transfer the data to the GPU, work on it, then transfer it back. That time is included and if you have a slow bus speed then that also may be part of the problem.Post your device-host and host-device bandwidth test

Sir,
Thank you very much for your detailed analysis report. I am now reporting some evidences with regard to possibilities that you have pointed out.

  1. I have taken the average over 5-6 times running the same set of code and each time i have also seen that the first execution is taking a bit longer than that of the subsequent calls.

  2. I am compiling my mex function using the below command:
    mex compute_fft2d.cu -I/usr/local/cuda-6.0/include -L/usr/local/cuda-6.0/lib64 -lcufft

I didn’t changed any compute capability or used any flags. Currently i think, i am using compute_10 and sm_10 architectures. If any change of flags or compute_capability version are required then i request you to please tell me the exact corresponding compilation command.

  1. CPU config:
    Intel Xeon CPU x5660 @ 2.80 GHz, 12 processors, 6 cores each

    GPU config:
    Nvidia Quadro 6000, 448 CUDA cores, OS: RHEL 5,update 5, Compute Capability 2.0

    Model: HP Z800 workstation

  2. Understood;

  3. Sir, since i am a new programmer in CUDA therefore i tried it several times but i am not able to perform the Bandwidth Test. I request you to please tell me the steps to perform the bandwidth test on RHEL 5.5 .

Looking forward for your reply very soon.
Thank you

Well given what you said the problem is exactly as I thought

  1. The compile build output is incomplete, but it is clear you are compiling as the lowest possible compute level (which is 1.0). For the Quadro 6000 that is compute 2.0. Here is an example of how to compile for nvcc: nvcc -arch=sm_35 -rdc=true hello_world.cu -o hello -lcudadevrt

  2. The Quadro 6000 (which is completely different than the K6000) is not a good compute GPU. To give you an idea of the difference a $700 GTX 780ti has an upper bound of bandwidth of 335 GBs, while the Quadro 6000 has an upper bound of 144 GBs. The 780ti has an upper bound of ~5000 Gflops for 32 bit while the Quadro 6000 has an upper bound of ~1000 Gflops 32 bit

  3. I cannot walk you through the entire compilation process for your OS, you can Google this.

  4. As I suspected you are comparing a very powerful multi-processor CPU setup to a single GPU which is older and not really designed for heavy computation. If you compare a single high-end CPU to a single high-end GPU you will see a massive difference for most tasks.

  5. The Quadro 6000 uses PCI-e 2.0 for host-device and device-host memory transfers, which (at best) will be about half what you would get with a more recent GPU which uses PCI-e 3.0. And this is assuming that you are able to use the full x16 lanes, but if you are not then the transfers are 1/4 the speed of a newer GPU. Since you operating on large data sets this is making a huge difference(MATLAB does not have to do this because it is using the CPUs)

Most of your questions can be answered here:

http://docs.nvidia.com/cuda/cuda-c-programming-guide#axzz36iZUBkgF

In addition I would like to add that on your first call to the MEX function you want to take care of all the allocations required by the processing (cufft plans and other cudaMalloc(…) calls etc).

This way you’ll be measuring:

Host2DeviceTransfer time +
FFT compute time +
Device2HostTransfer time

Instead of additional malloc and free time.

The safest way to get good accurate timing of the FFT would be to run the profiler on it. Here’s how you would do that with matlab:

  1. Create an m-file (FFT_Test.m) that calls your mex-file.
  2. Start Nvidia visual profiler -> start new session
  3. set matlab as the executable to run
  4. In input arguemtns set -nojvm -nosplash -r FFT_Test
  5. Make sure to add a “cudaDeviceReset()” somewhere at the end of your code, this is needed to make sure the profiler collects data smoothly.

In the visual profiler you’ll be provided with very exact timings for both malloc, frees, data transfers, and the FFT compute kernel time.

The FFT call will show up as several kernel calls (usually) and may appear a bit name mangled.

Dear CudaaduC,
Thank you very much for your very satisfying and logical reasons for getting the poor timings of my cufft run. After your earlier reply only, i got a thought in my mind that it may be happening because of the comparison between a poor compute GPU and a high end CPU. Thanks again.

Now, I wrote a pure CUDA with C code to compute the fft using CuFFT library and when i am comparing its execution time for a particular input array of size 8192 x 8192 to the execution time of MATLAB fft() for a similar input, i am getting a 13x time difference between both the applications. The execution time for CUDA with C application is 1.28 sec while for MATLAB application it is around 13.90 sec for similar 8192 x 8192 sized input array.
I want to ask that is it fair to compare the MATLAB execution time to that of the purely CUDA with C application. Is there any any other time overhead in MATLAB which is not present in the case of CUDA with C application due to which i cannot compare these two application timings.
Please enlighten me a little about this.
Thank you.

Dear Jimmy,
Thank you very much for your detailed information on mex call and specially profiler. I read about it several times while googling my error but i didn’t get it anywhere. Now i understood it completely and will post the exact timings after running this into my system.
Thanks again.

Hi Jimmy,
I tried profiling my mex function in the way you told me but as it starts running, after few seconds a dialog box is getting popped up which says:

Unable to profile application
"The application being profiled received a signal"

Here is my fft_check2d.m file

nx=128;
ny=128;
c=1;
x=ones(nx,ny)+i*ones(nx,ny);
for i=1:nx
    for j=1:ny
        x(i,j)=x(i,j)*c;
        c=c+1;
    end
end
y=compute_fft2d(x);
exit;

where compute_fft2d is my mex function for computing 2d FFT. Below is the code for compute_fft2d:

#include"cufft.h"
#include"complex"
#include"mex.h"
#include"gpu/mxGPUArray.h"
#define THREADS_PER_2D_BLOCK 512

void cleanUp()
{
cudaDeviceReset();
}

void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
if( MX_GPU_SUCCESS != mxInitGPU())
{
	mexPrintf("InitGPU failed\n");
};
mexAtExit(cleanUp);
//Getting dimensions of input matrix
long long m=mxGetM(prhs[0]);
long long n=mxGetN(prhs[0]);

plhs[0]=mxCreateDoubleMatrix((mwSize)m,(mwSize)n,mxCOMPLEX);

//Getting Matrices

double* Ar=mxGetPr(prhs[0]);
double* Ai=mxGetPi(prhs[0]);
double* Or=mxGetPr(plhs[0]);
double* Oi=mxGetPi(plhs[0]);


cufftComplex* d_in;
cufftComplex* d_out;

long long num_el=mxGetNumberOfElements(prhs[0]);
long long size_el=num_el*sizeof(double);
size_el=num_el*sizeof(cufftDoubleComplex);
if(cudaSuccess != cudaMalloc((void**)&d_in,size_el))
{
	mexPrintf("malooc of d_in failed \n");
};
if( cudaSuccess != cudaMalloc((void**)&d_out,size_el))
{
	mexPrintf("malooc of d_out failed \n");
};

cufftComplex* hostPtr=(cufftComplex*) malloc(sizeof(cufftComplex)*m*n);
for(long long i=0;i<m*n;i++)
{
	hostPtr[i].x=Ar[i];
	if(Ai!=NULL){
	hostPtr[i].y=Ai[i];}else
	{hostPtr[i].y=0;};
};
if (cudaSuccess != cudaMemcpy(d_in,hostPtr,m*n*sizeof(cufftComplex),cudaMemcpyHostToDevice))
{
	 mexPrintf("cuda memcpy hostPtr to d_in failed\n");
};
cufftHandle plan;
//if(CUFFT_SUCCESS != cufftPlan2d(&plan,n,m,CUFFT_Z2Z))
cufftResult _cp = cufftPlan2d(&plan,n,m,CUFFT_C2C);
if(CUFFT_SUCCESS != _cp)
{mexPrintf("CUFFT error: Plan creation failed, error code=%d\n",_cp);};
cufftResult _cr =cufftExecC2C(plan,d_in,d_out,CUFFT_FORWARD);
if(CUFFT_SUCCESS != _cr)
{mexPrintf("CUFFT error: Plan exucution failed, error code=%d\n",_cr);};
if( cudaSuccess != cudaThreadSynchronize())
{mexPrintf("Cuda error: Failed to synchronize\n");};
cufftDestroy(plan);
cudaMemcpy(hostPtr,d_out,n*m*sizeof(cufftDoubleComplex),cudaMemcpyDeviceToHost);
for(long long i=0;i<m*n;i++)
{
	Or[i]=hostPtr[i].x;
	Oi[i]=hostPtr[i].y;
};

free(hostPtr);
cudaFree(d_in);
cudaFree(d_out);
cudaDeviceReset();
return;	
}

I googled about this error and tried running it with some changes in the code like putting quit, exit at the end of .m file but the error still persists. Please help me with the necessary changes to remove the above mentioned error.
Thank you.

Yes i forgot to mention that there should be an exit at the end.

I’ve experienced similar problems like you described.

Could you

a) Verify that matlab launches and runs by for example calling

matlab.exe -nojvm -nosplash -r FFT_Test

from the command line. Perhpas add some printouts or other things to make sure the application does execute and shutdown.

b) Experiment with some additional flags:

-This worked for some people:

matlab -nojvm -nodesktop -wait -r FFT_Test

Thanks
J

This works (which is the basically the same as above):

http://www.orangeowlsolutions.com/archives/570

re: the MATLAB running time comarison -> Yes, the same code accessed via a mex in MATLAB will take slightly longer than if it had been a standard compiled app.

Having said that, I worked with a rather complex real-time MATLAB application which called my CUDA mex functions repeatedly without a hitch over long period of time. This was mostly cuBLAS and cuSparse related code, but the mex total running time was just under 5 ms(including a moderate sized memory transfer both directions) for each call, while the optimized equivalent MATLAB code (which did engage the multi-core capabilities of the CPU setup) took about 60 ms for each call.

Dear Jimmy and CudaaduC,

Thank you for sharing the documents and your precious knowledge with me. Now its working perfectly and i am now able to run a profiler on my mex function and getting different timings for different calls. I have now noted down the timings for different calls and executions and when i am summing them up it is coming around 332 msec. The profiler is showing some aspects in which CuFFT is not able to utilize the GPU fully. Many things have been marked as low like low Compute/Memcpy efficiency, low Multiprocessor occupancy, low compute utilization etc.

I have also tried to note down the timing for MATLAB fft() using below changes in my .m file:

nx=8192;
ny=8192;
c=1;
x=ones(nx,ny)+i*ones(nx,ny);
for i=1:nx
    for j=1:ny
        x(i,j)=x(i,j)*c;
        c=c+1;
    end
end
tic
y=fft(x);
t1=toc;
fprintf('Time taken:%f',t1);

and this time is around 256-258 msec.

I now want to ask that is there any way to increase the GPU utilization and get a low computation time compared to the Matlab version or is it like the gpu computation time cannot be improved further and this is the minimum time i could get using this GPU compared to my CPU.

If you could give me your e-mail ID then i want to send the visual profile to you so that you can take a look at that and get some consolidate ideas.

Eagerly waiting for your advice.
Thanks.

I have noticed that cufft calls when profiled are not always running optimally. Nvprof is rather brutal in the reports, but still is a very good tool. The code which seems to profile best is anything in the area of large brute-force algorithms.

You could write your own fft implementation which has optimizations which are made with your problem set in mind. Then you could make more direct changes based on the output of nvprof.

If there is any way to use the thrust library for your problems that is a rather good library. The sort is sorchingly fast.

I have already written a CUDA with C code for calculating 2D FFT of any input array. So, now I will try to optimize that code using profiler for that code. I will post here again if i face any difficulty in doing that.
By the way thank you very much for your help till now.
Thanks again.

Regards,
Niraj