# Window function for FFT

Hallo @ all

I would like to implement a window function on the graphic card.

I have a great array (1024*1000 datapoints --> These are 1000 waveforms. Each Waveform have 1024 sampling points) in the global memory. Bevor I calculate the FFT, the signal must be filtered with a “Hann Window”. The Hann Window have 1024 floating point coefficents. I would like to multiply 1024 floating point coefficents with each waveform.

First waveform 1

–> waveform_1[0] = waveform[0] *coefficent[0];

–> waveform_1[1] = waveform[1] *coefficent[1];

–> waveform_1[1024] = waveform[1024] *coefficent[1024];

Then waveform 2

–> waveform_2[0] = waveform[0] *coefficent[0];

–> waveform_2[1] = waveform[1] *coefficent[1];

–> waveform_2[1024] = waveform[1024] *coefficent[1024];

.

.

.

Then waveform 1000

–> waveform_1000[0] = waveform[0] *coefficent[0];

–> waveform_1000[1] = waveform[1] *coefficent[1];

–> waveform_1000[1024] = waveform[1024] *coefficent[1024];

You will receive a segmentation fault!

I don’t like to solve your riddles.

Let me guess …

You haven’t read any book yet, right?

And you’ve not the faintest clou, to solve your problem on your own – even after many people being nice to you have given so many hints and solutions to your previous problems.

``````__global__ window(float **waveformOut, float **waveformIn, float* coeff, int nwav, int len)

{

if( tidx >= len )

return;

if( tidy >= nwav )

return;

waveformOut[y][x] = waveformIn[y][x] * coeff[x];

/*

Lieber Pimbolie, lerne selber ein wenig. Es schadet nicht.

*/

}
``````

Launch the kernel with appropriate grid and block dimensions. The Programming guide will tell you how to do that.

And I hope I won’t see any questions like “Implement x+y in CUDA”.

Wer spricht hier deutsch?

Probably many people. You can detect some of them (and me too) by their german-style-english.

Vielleicht ist es einfach fÃ¼r mich das ganze in deutsch zu erklÃ¤ren.

1 .Ich habe als Eingang ein endimensionales Array. Dieses enthÃ¤lt die 1000 Signal. Jedes Signal besteht aus 1024 Abtastwerten
2. Nun mÃ¶chte ich jedes einzelne Signal mit den Filter multiplizieren. Ich will den 1 Wert des Signals mit dem 1 Filterwert multiplizieren. Parallel dazu den 2 Wert des Signals mit dem 2. Filterwert. (Dies hast Du auch schon in Deinem Beispiel gemacht nur mit einem 2 dimensionalen Array)
3. Das Ergebnis kann im gleich Array abgespeichert werden, es sein den es geht einfacher, indem man ein neus Array erstellt.
4. Danach mÃ¶chte ich die FFT berechnen. Dies habe ich bereits schon ausprobiert und es funktioniert.

global window(float waveform, float coeff, int nwav, int len)
{

Ich weiÃŸ halt das es in C mit einer verschachtelten For-Schleife geht. Jedoch habe ich momentan noch Probleme mir das ganze in der Grafikkartenstruktur vorzustellen.

float data[1024*1000];
float coeff[1024];

for(i=0; i<(1024*1000); i=i+1024)
{
for(j=0; j<1024; j++)
(
data[i] = data[i] * coeff[j];
)
}

Also German seems not to be your native language. ;-)

``````__global__ window(float **waveform, float* coeff, int nwav, int len)

{

if( tidx >= len )

return;

if( tidy >= nwav )

return;

waveform[y*len + x] = waveform[y*len + x] * coeff[x];

/*

oder: waveform[y*len + x] *= coeff[x];

*/

/*

Deutsch oder englisch - Deine Fragen zeigen, dass Du 0 (in Worten Null) Erfahrung in C hast.

Warum um alles in der Welt machst Du mit CUDA rum???

*/

}
``````

Ich arbeit seit einigen Jahren mit Labview. Ich will aus den funktionen eine DLL machen und die Funktionen dann in Labview implementieren. Habe auch sehr lange FPGA in VHDL programmiert. Deshalb sind die Kenntnisse schon etwas eingerostet. Ich weiÃŸ das **iein Pointer auf Pointer ist. Jedoch muss ich erst mal wieder nachschlagen was dies genau noch mal war. Ich versuche ja wieder reinzukommen. Aber ich habe halt so viele andere Sachen nebenbei zu tun, das es immer sehr schwierig ist.

Ich reserviere ein eindimensionales Array auf der GPU und auf dem PC (das bekomme ich auch hin) dann kopiere ich die Daten vom PC auf die Grafikkarte (das bekomme ich auch noch hin)

Jetzt muss ich den Kernel ausfÃ¼hren. Wie muss ich denn noch mal die Funktion aufrufen, wenn ich ein eindimensionales Array habe?

window(? ,coeff, 1000, 1024)

Straight forward:

``````float data[1024*1000];

float coeff[1024];

.

.

.

window(data ,coeff, 1000, 1024)
``````

Doch so einfach ^^

Ich werde heute noch mal in C-Tutorial nachschlagen ^^

1. Ist jetzt Deine 1. LÃ¶sung schneller (besser) oder die letzte?

Sorry fÃ¼r die Fehler aber stÃ¤ndig klingelt das Telefon oder eine Arbeitskollege will etwas von mir. Ich habe gesehn Du bist 1979 gebohren. Das scheint ein guter Jahrgang gewesen zu sein ;)

2Auf wie vielen Threads kann ich das ganze den optimal laufen lassen? window<<<1024*1000/512,512>>>(data, coeff, 1000, 1024) –>Bekomme ich so die optimale Perfomance hin?

Wenn ich die AusfÃ¼hrungzeit des Kernels in ms messen mÃ¶chte gibt es ja die Timerfunktionen.

Zuerst muss man ja einen Timer erzuegen. Dann startet man den Timer bevor man den Kernel ausfÃ¼hrt. AnschlieÃŸend hÃ¤lt man den Timer an und liest den Timerwert aus.

z.B.

start Timer
window<<<1024*1000/512,512>>>(data, coeff, 1000, 1024)
stop timer
timer auslesen --> ich weiÃŸ es sind jetzt nicht die richtigen Befehle (habe diese jetzt nicht im Kopf) Aber messe ich mit diesen Programmaufbau wirklich nur die AusfÃ¼hrungszeit der Kernelfunktion?

I would prefer the second version on cpu, 'cause the chance of cache misses is slightly lower when using linear memory en bloc. But on GPU ???

Don’t know. <<<1024*1000/512,512>>> seems quite sensible. There’s a section about that in the programming guide.

I use cudaEvent (see Programming guide) for time measurement.

Auf jedenfall vielen Dank fÃ¼r Deine MÃ¼he und super Hinweise. Ich werde es heute Abend mal bei mir zu hause ausprobieren. Gibt zwar wieder Ã¤rger mit der Frau :D aber was tut man nicht alles fÃ¼r die Arbeit.

dito >.<

dito

I have 3 errors in my code.

First error: argument of type “float *” is incompatible with parameter of type “float **”

Second error: identifier “y” is undefined

Third error: identifier “x” is undefined

[b]

Can anybody help me?[/b]

Here ist my code:

// includes, system

#include <stdlib.h>

#include <stdio.h>

#include <string.h>

#include <math.h>

// includes, project

#include <cufft.h>

#include <cutil_inline.h>

// Complex data type

typedef float2 Complex;

static global void window(float *, float , int, int);

unsigned int timer = 0;

float elapsedTimeInMs = 0.0f;

////////////////////////////////////////////////////////////////////////////////

// declaration, forward

void runTest(int argc, char** argv);

// The filter size is assumed to be a number smaller than the signal size

#define SIGNAL_SIZE 1024*1000

#define FILTER_SIZE 1024

////////////////////////////////////////////////////////////////////////////////

// Program main

////////////////////////////////////////////////////////////////////////////////

int main(int argc, char** argv)

{

``````runTest(argc, argv);
``````

cutilExit(argc, argv);

}

////////////////////////////////////////////////////////////////////////////////

//! Run a simple test for CUDA

////////////////////////////////////////////////////////////////////////////////

void runTest(int argc, char** argv)

{

if( cutCheckCmdLineFlag(argc, (const char**)argv, “device”) )

``````    cutilDeviceInit(argc, argv);

else

cudaSetDevice( cutGetMaxGflopsDeviceId() );
``````

// Allocate host memory for the waveform

``````float* h_waveform = (float*)malloc(sizeof(float) * SIGNAL_SIZE);

float* h_coeff = (float*)malloc(sizeof(float) * FILTER_SIZE);
``````

// Initalize the memory for the signal

``````for(unsigned int i = 0; i < SIGNAL_SIZE; ++i)

{

h_waveform[i] = 2.4374328223;

}
``````

// Initalize the memory for the signal

``````for(unsigned int i = 0; i < FILTER_SIZE; ++i)

{

h_coeff[i] = 4.312312312;

}
``````

for(unsigned int i = 0; i < 10; ++i)

``````{

printf("Waveform: %f \n", h_waveform[i]);

}
``````

for(unsigned int i = 0; i < 10; ++i)

``````{

printf("Coefficiens: %f \n", h_coeff[i]);

}
``````

cutilCheckError( cutCreateTimer( &timer ) );

int mem_size_waveform = sizeof(float) * SIGNAL_SIZE;

``````int mem_size_coeff = sizeof(float) * FILTER_SIZE;
``````

// Allocate device memory for signal

``````float* d_waveform;

cutilSafeCall(cudaMalloc((void**)&d_waveform, mem_size_waveform));
``````

float* d_coeff;

``````cutilSafeCall(cudaMalloc((void**)&d_coeff, mem_size_coeff));
``````

// Copy waveform from CPU memory to GPU memory

``````cutilSafeCall(cudaMemcpy(d_waveform, h_waveform, mem_size_waveform, cudaMemcpyHostToDevice));
``````

// Copy coeff from CPU memory to GPU memory

``````cutilSafeCall(cudaMemcpy(d_coeff, h_coeff, mem_size_coeff, cudaMemcpyHostToDevice));
``````

//den Timer starten

``````cutilCheckError( cutStartTimer( timer));
``````

window<<<SIGNAL_SIZE/512, 512>>>(d_waveform,d_coeff,1000,1024); --> argument of type “float *” is incompatible with parameter of type “float **”

//den Timer anhalten

``````cutilCheckError( cutStopTimer( timer));

elapsedTimeInMs = cutGetTimerValue( timer);
``````

// Check if kernel execution generated and error

``````cutilCheckMsg("Kernel execution failed [ ComplexPointwiseMulAndScale ]");
``````

// Allocate host memory for the result

``````float* h_result = (float*)malloc(sizeof(float) * SIGNAL_SIZE);
``````

// Copy device memory to host

``````cutilSafeCall(cudaMemcpy(h_result, d_waveform, mem_size_waveform, cudaMemcpyDeviceToHost));
``````

// Das Ergebnis anzeigen

``````for (unsigned int i = 0; i < 10; ++i)

{

printf(" Ergebnis: %2.20f \n", h_result[i]);

}
``````

printf("\nZeit: %f \n", elapsedTimeInMs);

// cleanup memory

``````free(h_waveform);

free(h_coeff);

cutilSafeCall(cudaFree(d_waveform));

cutilSafeCall(cudaFree(d_coeff));
``````

}

// Betrag berechnen

global void window(float *waveform, float coeff, int nwav, int len)

{

int tidx = threadIdx.x + blockIdx.x*blockDim.x;

``````int tidy = threadIdx.y + blockIdx.y*blockDim.y;
``````

if( tidx >= len )

``````{

return;

}

if( tidy >= nwav )

{

return;

}
``````

waveform[ylen + x] = waveform[ylen + x] * coeff; -->identifier “y” is undefined --> identifier “x” is undefined

}

I have fixed the error 2 and 3 but now I have a new error.

Here my code:

waveform[tidy * number_of_coefficients + tidx] = waveform[tidy * number_of_coefficients + tidx] * coeff[tidx]; --> here ist my error message :expression must have arithmetic or enum type

This was my old code:

global void window(float **waveform, float* coeff, int n_waveform, int number_of_coefficients) --> Here was the error

My new code is :

global void window(float waveform, float coeff, int n_waveform, int number_of_coefficients)

{

``````nt tidx = threadIdx.x + blockIdx.x*blockDim.x;

int tidy = threadIdx.y + blockIdx.y*blockDim.y;
``````

if( tidx >= number_of_coefficients )

``````{

return;

}

if( tidy >= n_waveform )

{

return;

}
``````

waveform[tidy * number_of_coefficients + tidx] = waveform[tidy * number_of_coefficients + tidx] * coeff[tidx];

}

Now I can compile the projekt without errros. But the result is right only for the first waveform.

I create a coeff array. The coefficient[0] = 0, the coefficient[2] = 1 … the coefficient[1024] = 1023

I create a waveform array. The waveform[0] = 0, waveform[1] = 1 … waveform[1023999] = 1023999

But my Result is: --> Result[0] = 0, Result[1] = 1, Result[2] = 4, Result[3] = 9 … Result[1023] = 1046529 --> This results are right

But Result[1024] = 1024 --> This Result is frong. The right result must be 0

But Result[1024] = 1025 --> This Result is frong. The right result must be 1025

But Result[1024] = 1026 --> This Result is frong. The right result must be 2052

But Result[1024] = 1027 --> This Result is frong. The right result must be 4108

Result [1023999]

Here is my new complete code:

// includes, system

#include <stdlib.h>

#include <stdio.h>

#include <string.h>

#include <math.h>

// includes, project

#include <cufft.h>

#include <cutil_inline.h>

// Complex data type

typedef float2 Complex;

static global void window(float , float , int, int);

unsigned int timer = 0;

float elapsedTimeInMs = 0.0f;

////////////////////////////////////////////////////////////////////////////////

// declaration, forward

void runTest(int argc, char** argv);

// The filter size is assumed to be a number smaller than the signal size

#define SIGNAL_SIZE 1024*1000

#define FILTER_SIZE 1024

////////////////////////////////////////////////////////////////////////////////

// Program main

////////////////////////////////////////////////////////////////////////////////

int main(int argc, char** argv)

{

``````runTest(argc, argv);
``````

cutilExit(argc, argv);

}

////////////////////////////////////////////////////////////////////////////////

//! Run a simple test for CUDA

////////////////////////////////////////////////////////////////////////////////

void runTest(int argc, char** argv)

{

if( cutCheckCmdLineFlag(argc, (const char**)argv, “device”) )

``````    cutilDeviceInit(argc, argv);

else

cudaSetDevice( cutGetMaxGflopsDeviceId() );
``````

// Allocate host memory for the waveform

``````float* h_waveform = (float*)malloc(sizeof(float) * SIGNAL_SIZE);

float* h_coeff = (float*)malloc(sizeof(float) * FILTER_SIZE);
``````

// Initalize the memory for the signal

``````for(unsigned int i = 0; i < SIGNAL_SIZE; ++i)

{

//h_waveform[i] = 2.4374328223;

h_waveform[i] = i;

}
``````

// Initalize the memory for the signal

``````for(unsigned int i = 0; i < FILTER_SIZE; ++i)

{

//h_coeff[i] = 4.312312312;

h_coeff[i] = i;

}
``````

for(unsigned int i = 1019; i < 1019 + 10; ++i)

``````{

printf("Waveform: %f \n", h_waveform[i]);

}
``````

for(unsigned int i = 0; i < 10; ++i)

``````{

printf("Coefficiens: %f \n", h_coeff[i]);

}
``````

cutilCheckError( cutCreateTimer( &timer ) );

int mem_size_waveform = sizeof(float) * SIGNAL_SIZE;

``````int mem_size_coeff = sizeof(float) * FILTER_SIZE;
``````

// Allocate device memory for signal

``````float* d_waveform;

cutilSafeCall(cudaMalloc((void**)&d_waveform, mem_size_waveform));
``````

float* d_coeff;

``````cutilSafeCall(cudaMalloc((void**)&d_coeff, mem_size_coeff));
``````

// Copy waveform from CPU memory to GPU memory

``````cutilSafeCall(cudaMemcpy(d_waveform, h_waveform, mem_size_waveform, cudaMemcpyHostToDevice));
``````

// Copy coeff from CPU memory to GPU memory

``````cutilSafeCall(cudaMemcpy(d_coeff, h_coeff, mem_size_coeff, cudaMemcpyHostToDevice));
``````

//den Timer starten

``````cutilCheckError( cutStartTimer( timer));
``````

window<<<SIGNAL_SIZE/512, 512>>>(d_waveform, d_coeff, 1000, 1024);

//den Timer anhalten

``````cutilCheckError( cutStopTimer( timer));

elapsedTimeInMs = cutGetTimerValue( timer);
``````

// Check if kernel execution generated and error

``````cutilCheckMsg("Kernel execution failed [ ComplexPointwiseMulAndScale ]");
``````

// Allocate host memory for the result

``````float* h_result = (float*)malloc(sizeof(float) * SIGNAL_SIZE);
``````

// Copy device memory to host

``````cutilSafeCall(cudaMemcpy(h_result, d_waveform, mem_size_waveform, cudaMemcpyDeviceToHost));
``````

// Das Ergebnis anzeigen

``````for (unsigned int i = 1019; i < 1019+10; ++i)

{

printf(" Ergebnis: %2.20f \n", h_result[i]);

}
``````

printf("\nZeit: %f \n", elapsedTimeInMs);

// cleanup memory

``````free(h_waveform);

free(h_coeff);

cutilSafeCall(cudaFree(d_waveform));

cutilSafeCall(cudaFree(d_coeff));
``````

}

// Betrag berechnen

global void window(float waveform, float coeff, int n_waveform, int number_of_coefficients)

{

int tidx = threadIdx.x + blockIdx.x*blockDim.x;

``````int tidy = threadIdx.y + blockIdx.y*blockDim.y;
``````

if( tidx >= number_of_coefficients )

``````{

return;

}

if( tidy >= n_waveform )

{

return;

}
``````

waveform[tidy * number_of_coefficients + tidx] = waveform[tidy * number_of_coefficients + tidx] * coeff[tidx];

}

Hi,

you have to launch the kernel in a 2-dimensional grid. I already mentioned :ph34r: . It is something like:

``````dim3 dimGrid;

dim3 dimBlock;

dimBlock.x = 512;

dimBlock.y = 500;

dimGrid.x = 1024/dimBlock.x;

dimGrid.y = 1000/dimBlock.y;

window<<<dimGrid, dimBlock>>>(...)
``````

Refer to the Programming Guide (Section 2.2).

dim3 dimGrid;
dim3 dimBlock;

dimBlock.x = 512;
dimBlock.y = 500;

dimGrid.x = 1024/dimBlock.x;
dimGrid.y = 1000/dimBlock.y;

window<<<dimGrid, dimBlock>>>(d_waveform, d_coeff, 1000, 1024);

Beim compilieren wurde kein Fehler angezeigt, jedoch stÃ¼rtzt das Programm ab. Ich denke mal irgendwo ist ein Zahlendreher, das in den falschen Bereich geschrieben wird.

Hier ist die Fehlermeldung:

Eine Ausnahme (erste Chance) bei 0x7c812afb in simpleCUFFT.exe: Microsoft C+±Ausnahme: cudaError an Speicherposition 0x0012fe84…
Das Programm “[1680] simpleCUFFT.exe: Systemeigen” wurde mit Code -1 (0xffffffff) beendet.