Problem probably in cuFFT

Dear All

I profile (see pictures bellow) a program with the code like the bellow one. There is a time gap between tudo1 and cuFFT that I can not explain. Any clue? Is a CUDA FFT bug?

I am getting good results. Only the runtime is too high.

In the code bellow SYMB=32;

#omp parallel num_threads(16)

tudo1<<<SYMB/32, 32,0,stream>>>(symbols,comp1,(framematch)+((z5 << 1))*(512),framematch+((z5 << 1)+1)*(512),NRSAMPLES,NSC,MAXSPR,(u-t-1)*qk[g],z5);
   cufftPlan1d(&plan[0], SYMB, CUFFT_C2C, 1);

cufftExecC2C(plan[0], (cufftComplex *)symbols,
               		   (cufftComplex *)symbols1, CUFFT_INVERSE);



Luis Gonçalves

seems that you still complain about the pesky cudafree you can not explain/ place

you should have posted an update under your prior post, such that one can more easily interpret the latest results and make new suggestions

i think this is what is known thus far:
a) the cudafree call is successful; else the program should crash; if you feel this is too presumptuous, you could always step the program with the debugger; the debugger should at least make note if the cudafree fails
b) there is the proposition (hypothesis) that the cudafree is indirectly called by your program, rather than you directly calling cudafree - you are calling some api than requires a device memory allocation, and that subsequently needs to clean up after itself

what now catches my eye, is that the cudafree call occurs for a single thread, as opposed to each thread
this might imply that it originates from an api call prior to instigating your numerous openmp threads, rather than an api within your openmp threads, and it may very well relate to your usage of openmp

therefore, these would be my suggestions now:

as a test case, only launch a single thread, as opposed to the numerous threads you launch, and have it as an ordinary thread, rather than an openmp thread; profile and see if the cudafree is still present
in other words, only launch a single task, such that you do not need mechanisms like openmp or streams, and use it as a test case
alternatively, as an alternative to using openmp, use a single thread, create a number of streams, and issue all your work in streams, as opposed to issuing the tasks/ work within openmp threads; profile and see if the cudafree is still present
or, use the debugger and step your program; attempt to step into each api call you make, prior to launching the openmp threads, and note if it contains a memory allocation; i have not yet attempted to step into cuda apis, i do not exactly know if it is indeed possible

Do you see in the profile (first figure) the time gap between tudo1 and cuFFT (Radix)?

cudaFree is called from cuFFT.

You may want to move the plan creation up,

cufftPlan1d(&plan[0], SYMB, CUFFT_C2C, 1);

tudo1<<<SYMB/32, 32,0,stream>>>(symbols,comp1,(framematch)+((z5 << 1))*(512),framematch+((z5 << 1)+1)*(512),NRSAMPLES,NSC,MAXSPR,(u-t-1)*qk[g],z5);
cufftExecC2C(plan[0], (cufftComplex *)symbols,(cufftComplex *)symbols1, CUFFT_INVERSE);

You may also want to consider to use the batch capabilities of cuFFT and get rid of the openMP, your code is not very
efficient as it is.

Thanks for the help. I put all related with “Plan” in the initialization code. Now the core code takes 17ms.

But, why does cuFFT call cudaFree? why takes it so long? Why does it call only one time for any number of transforms?