FFT on CUDA Newbie has lots of problems!

Hi all,
I’m new to this forum, so please forgive me for the errors that I will make… and for my bad english too!
I’m a Ph.D. student in computer science and I work mainly in the audio field (3d-audio and stuffs like that).

I just started a project related to my thesis and i a got a HUGE amount of problems.
I developed a very simple code based on CUFFT example. But… i think the example itself has some problems. By definition if u have X and Y as input signals whose length is let’s say Nx and Ny the convolution will be of length Nx+Ny-1 while in the example is something “strange” as the longest signal + half of the shortest. Maybe i’m missing something but… this is not as important as the fact that if i do fft then ifft on X with those examples i won’t get the same signal! Which is even worse!!!
So basically what i understood is that a problem related to a “scaling factor” that “seems” related to the size of the window. But i can’t figure out the problem.
I hereby attach here the source code that gives me trouble. I’m using libsndfile to handle audio files but, the core can be easily extracted and applied to any array. I rewrite the “size calculation” so now it is Xn+Yn-1.
I will really appreciate any kind of help.
I work under MacOsX 10.6.8 and i really wanna work under win 7 also but i can’t succeed…

Thank you very much for your kind cooperation.
Best regards, Murivan!

No one can give me some advice?
I solved the issues related to VS uninstalling and then reinstalling all the stuff…
Now… I can compile code on both machines but with my macbook pro (equipped with 9400GM) it gives me correct results while on WIN 7 with a QUADRO FX 4600 gives me wrong results. I know the “correctness” of results since i use as input two signals that MUST gives in output the sync signal. Sync is sinx/x for a limited bandwidth signal.

Any help appreciated!
GPUconv.cu (13 KB)

I solved most of the issues but some points still remain unclear.
Since I use CUFFT how can redistribute my code? Do i need to install coda toolkit on each machine? How can I make a DLL (on win) with my stuff (I obviously know what it theoretically mean but… since i’m most a mac user this pose some problems to me, i have VS 2010, parallel nsight, cuda toolkit 4).
And then some problems of “fine tuning” and optimization.
Since i convolve two signal in fft domain and then go back to time i have to “rescale” them. What i do know is find a maximum and then divide for it (since i need to normalize between -1 and 1). What i do now is a sequential search, i know that this can be posed as a “reduction” problem but i still have to get comfy with parallel programming on gpus.
Based on input size i choose between doing only one ENORMOUS fft or to split in a number of smaller ones. In the latter case i’m wasting resources since i’m doing it sequentially, it is possible to parallelize this task? Please note that is not always possible to load the entire input on the device since it could not fit (this is basically why i need to split). There is something that still i can do to take advantages from parallel programming?

Thank you for your collaboration.

Bests.