Window function for FFT

Hallo @ all

I would like to implement a window function on the graphic card.

I have a great array (1024*1000 datapoints --> These are 1000 waveforms. Each Waveform have 1024 sampling points) in the global memory. Bevor I calculate the FFT, the signal must be filtered with a “Hann Window”. The Hann Window have 1024 floating point coefficents. I would like to multiply 1024 floating point coefficents with each waveform.

First waveform 1

–> waveform_1[0] = waveform[0] *coefficent[0];

–> waveform_1[1] = waveform[1] *coefficent[1];

–> waveform_1[1024] = waveform[1024] *coefficent[1024];

Then waveform 2

–> waveform_2[0] = waveform[0] *coefficent[0];

–> waveform_2[1] = waveform[1] *coefficent[1];

–> waveform_2[1024] = waveform[1024] *coefficent[1024];

.

.

.

Then waveform 1000

–> waveform_1000[0] = waveform[0] *coefficent[0];

–> waveform_1000[1] = waveform[1] *coefficent[1];

–> waveform_1000[1024] = waveform[1024] *coefficent[1024];

You will receive a segmentation fault!

So, what’s your question?

Please, take more care in asking questions here?

I don’t like to solve your riddles.

Let me guess …

You haven’t read any book yet, right?

And you’ve not the faintest clou, to solve your problem on your own – even after many people being nice to you have given so many hints and solutions to your previous problems.

Please make an effort!

Let’s adapt one of your existing kernels:

__global__ window(float **waveformOut, float **waveformIn, float* coeff, int nwav, int len)

{

   tidx = threadIdx.x + blockIdx.x*blockDim.x;

   tidy = threadIdx.y + blockIdx.y*blockDim.y;

if( tidx >= len )

	 return;

   if( tidy >= nwav )

	 return;

waveformOut[y][x] = waveformIn[y][x] * coeff[x];

/*

Lieber Pimbolie, lerne selber ein wenig. Es schadet nicht.

*/

}

Launch the kernel with appropriate grid and block dimensions. The Programming guide will tell you how to do that.

And I hope I won’t see any questions like “Implement x+y in CUDA”.

Wer spricht hier deutsch?

Probably many people. You can detect some of them (and me too) by their german-style-english.

Vielleicht ist es einfach für mich das ganze in deutsch zu erklären.

1 .Ich habe als Eingang ein endimensionales Array. Dieses enthält die 1000 Signal. Jedes Signal besteht aus 1024 Abtastwerten
2. Nun möchte ich jedes einzelne Signal mit den Filter multiplizieren. Ich will den 1 Wert des Signals mit dem 1 Filterwert multiplizieren. Parallel dazu den 2 Wert des Signals mit dem 2. Filterwert. (Dies hast Du auch schon in Deinem Beispiel gemacht nur mit einem 2 dimensionalen Array)
3. Das Ergebnis kann im gleich Array abgespeichert werden, es sein den es geht einfacher, indem man ein neus Array erstellt.
4. Danach möchte ich die FFT berechnen. Dies habe ich bereits schon ausprobiert und es funktioniert.

global window(float waveform, float coeff, int nwav, int len)
{
tidx = threadIdx.x + blockIdx.x*blockDim.x;

Ich weiß halt das es in C mit einer verschachtelten For-Schleife geht. Jedoch habe ich momentan noch Probleme mir das ganze in der Grafikkartenstruktur vorzustellen.

float data[1024*1000];
float coeff[1024];

for(i=0; i<(1024*1000); i=i+1024)
{
for(j=0; j<1024; j++)
(
data[i] = data[i] * coeff[j];
)
}

Also German seems not to be your native language. ;-)

__global__ window(float **waveform, float* coeff, int nwav, int len)

{

   tidx = threadIdx.x + blockIdx.x*blockDim.x;

   tidy = threadIdx.y + blockIdx.y*blockDim.y;

if( tidx >= len )

	 return;

   if( tidy >= nwav )

	 return;

waveform[y*len + x] = waveform[y*len + x] * coeff[x];

/*

  oder: waveform[y*len + x] *= coeff[x];

*/

/*

Deutsch oder englisch - Deine Fragen zeigen, dass Du 0 (in Worten Null) Erfahrung in C hast.

Warum um alles in der Welt machst Du mit CUDA rum???

*/

}

Ich arbeit seit einigen Jahren mit Labview. Ich will aus den funktionen eine DLL machen und die Funktionen dann in Labview implementieren. Habe auch sehr lange FPGA in VHDL programmiert. Deshalb sind die Kenntnisse schon etwas eingerostet. Ich weiß das **iein Pointer auf Pointer ist. Jedoch muss ich erst mal wieder nachschlagen was dies genau noch mal war. Ich versuche ja wieder reinzukommen. Aber ich habe halt so viele andere Sachen nebenbei zu tun, das es immer sehr schwierig ist.

Ich reserviere ein eindimensionales Array auf der GPU und auf dem PC (das bekomme ich auch hin) dann kopiere ich die Daten vom PC auf die Grafikkarte (das bekomme ich auch noch hin)

Jetzt muss ich den Kernel ausführen. Wie muss ich denn noch mal die Funktion aufrufen, wenn ich ein eindimensionales Array habe?

window(? ,coeff, 1000, 1024)

Straight forward:

float data[1024*1000];

float coeff[1024];

.

.

.

window(data ,coeff, 1000, 1024)

Doch so einfach ^^

Ich werde heute noch mal in C-Tutorial nachschlagen ^^

  1. Ist jetzt Deine 1. Lösung schneller (besser) oder die letzte?

Sorry für die Fehler aber ständig klingelt das Telefon oder eine Arbeitskollege will etwas von mir. Ich habe gesehn Du bist 1979 gebohren. Das scheint ein guter Jahrgang gewesen zu sein ;)

2Auf wie vielen Threads kann ich das ganze den optimal laufen lassen? window<<<1024*1000/512,512>>>(data, coeff, 1000, 1024) –>Bekomme ich so die optimale Perfomance hin?

Wenn ich die Ausführungzeit des Kernels in ms messen möchte gibt es ja die Timerfunktionen.

Zuerst muss man ja einen Timer erzuegen. Dann startet man den Timer bevor man den Kernel ausführt. Anschließend hält man den Timer an und liest den Timerwert aus.

z.B.

start Timer
window<<<1024*1000/512,512>>>(data, coeff, 1000, 1024)
stop timer
timer auslesen --> ich weiß es sind jetzt nicht die richtigen Befehle (habe diese jetzt nicht im Kopf) Aber messe ich mit diesen Programmaufbau wirklich nur die Ausführungszeit der Kernelfunktion?

:thumbup: :thumbup: :thumbup:

Ad 1)

I would prefer the second version on cpu, 'cause the chance of cache misses is slightly lower when using linear memory en bloc. But on GPU ???

Ad 2)

Don’t know. <<<1024*1000/512,512>>> seems quite sensible. There’s a section about that in the programming guide.

I use cudaEvent (see Programming guide) for time measurement.

Auf jedenfall vielen Dank für Deine Mühe und super Hinweise. Ich werde es heute Abend mal bei mir zu hause ausprobieren. Gibt zwar wieder ärger mit der Frau :D aber was tut man nicht alles für die Arbeit.

dito >.<

dito

I have 3 errors in my code.

First error: argument of type “float *” is incompatible with parameter of type “float **”

Second error: identifier “y” is undefined

Third error: identifier “x” is undefined

[b]

Can anybody help me?[/b]

Here ist my code:

// includes, system

#include <stdlib.h>

#include <stdio.h>

#include <string.h>

#include <math.h>

// includes, project

#include <cufft.h>

#include <cutil_inline.h>

// Complex data type

typedef float2 Complex;

static global void window(float *, float , int, int);

unsigned int timer = 0;

float elapsedTimeInMs = 0.0f;

////////////////////////////////////////////////////////////////////////////////

// declaration, forward

void runTest(int argc, char** argv);

// The filter size is assumed to be a number smaller than the signal size

#define SIGNAL_SIZE 1024*1000

#define FILTER_SIZE 1024

////////////////////////////////////////////////////////////////////////////////

// Program main

////////////////////////////////////////////////////////////////////////////////

int main(int argc, char** argv)

{

runTest(argc, argv);

cutilExit(argc, argv);

}

////////////////////////////////////////////////////////////////////////////////

//! Run a simple test for CUDA

////////////////////////////////////////////////////////////////////////////////

void runTest(int argc, char** argv)

{

if( cutCheckCmdLineFlag(argc, (const char**)argv, “device”) )

    cutilDeviceInit(argc, argv);

else

    cudaSetDevice( cutGetMaxGflopsDeviceId() );

// Allocate host memory for the waveform

float* h_waveform = (float*)malloc(sizeof(float) * SIGNAL_SIZE);

float* h_coeff = (float*)malloc(sizeof(float) * FILTER_SIZE);

// Initalize the memory for the signal

for(unsigned int i = 0; i < SIGNAL_SIZE; ++i) 

{        

    h_waveform[i] = 2.4374328223;        

}

// Initalize the memory for the signal

for(unsigned int i = 0; i < FILTER_SIZE; ++i) 

{            

    h_coeff[i] = 4.312312312;                              

}

for(unsigned int i = 0; i < 10; ++i)

{                        

    printf("Waveform: %f \n", h_waveform[i]);        

}

for(unsigned int i = 0; i < 10; ++i)

{                        

    printf("Coefficiens: %f \n", h_coeff[i]);        

}

cutilCheckError( cutCreateTimer( &timer ) );

int mem_size_waveform = sizeof(float) * SIGNAL_SIZE;

int mem_size_coeff = sizeof(float) * FILTER_SIZE;

// Allocate device memory for signal

float* d_waveform;

cutilSafeCall(cudaMalloc((void**)&d_waveform, mem_size_waveform));

float* d_coeff;

cutilSafeCall(cudaMalloc((void**)&d_coeff, mem_size_coeff));

// Copy waveform from CPU memory to GPU memory

cutilSafeCall(cudaMemcpy(d_waveform, h_waveform, mem_size_waveform, cudaMemcpyHostToDevice));

// Copy coeff from CPU memory to GPU memory

cutilSafeCall(cudaMemcpy(d_coeff, h_coeff, mem_size_coeff, cudaMemcpyHostToDevice));

//den Timer starten

cutilCheckError( cutStartTimer( timer));

window<<<SIGNAL_SIZE/512, 512>>>(d_waveform,d_coeff,1000,1024); --> argument of type “float *” is incompatible with parameter of type “float **”

//den Timer anhalten

cutilCheckError( cutStopTimer( timer));

elapsedTimeInMs = cutGetTimerValue( timer);

// Check if kernel execution generated and error

cutilCheckMsg("Kernel execution failed [ ComplexPointwiseMulAndScale ]");

// Allocate host memory for the result

float* h_result = (float*)malloc(sizeof(float) * SIGNAL_SIZE);

// Copy device memory to host

cutilSafeCall(cudaMemcpy(h_result, d_waveform, mem_size_waveform, cudaMemcpyDeviceToHost));

// Das Ergebnis anzeigen

for (unsigned int i = 0; i < 10; ++i) 

{

    printf(" Ergebnis: %2.20f \n", h_result[i]);        

}

printf("\nZeit: %f \n", elapsedTimeInMs);

// cleanup memory

free(h_waveform);

free(h_coeff);      

cutilSafeCall(cudaFree(d_waveform));

cutilSafeCall(cudaFree(d_coeff));

cudaThreadExit();

}

// Betrag berechnen

global void window(float *waveform, float coeff, int nwav, int len)

{

int tidx = threadIdx.x + blockIdx.x*blockDim.x;

int tidy = threadIdx.y + blockIdx.y*blockDim.y;

if( tidx >= len )

{

    return;

}

if( tidy >= nwav )

{

    return;

}

waveform[ylen + x] = waveform[ylen + x] * coeff; -->identifier “y” is undefined --> identifier “x” is undefined

}

I have fixed the error 2 and 3 but now I have a new error.

Here my code:

waveform[tidy * number_of_coefficients + tidx] = waveform[tidy * number_of_coefficients + tidx] * coeff[tidx]; --> here ist my error message :expression must have arithmetic or enum type

This was my old code:

global void window(float **waveform, float* coeff, int n_waveform, int number_of_coefficients) --> Here was the error

My new code is :

global void window(float waveform, float coeff, int n_waveform, int number_of_coefficients)

{

nt tidx = threadIdx.x + blockIdx.x*blockDim.x;

int tidy = threadIdx.y + blockIdx.y*blockDim.y;

if( tidx >= number_of_coefficients )

{

    return;

}

if( tidy >= n_waveform )

{

    return;

}

waveform[tidy * number_of_coefficients + tidx] = waveform[tidy * number_of_coefficients + tidx] * coeff[tidx];

}

Now I can compile the projekt without errros. But the result is right only for the first waveform.

I create a coeff array. The coefficient[0] = 0, the coefficient[2] = 1 … the coefficient[1024] = 1023

I create a waveform array. The waveform[0] = 0, waveform[1] = 1 … waveform[1023999] = 1023999

But my Result is: --> Result[0] = 0, Result[1] = 1, Result[2] = 4, Result[3] = 9 … Result[1023] = 1046529 --> This results are right

But Result[1024] = 1024 --> This Result is frong. The right result must be 0

But Result[1024] = 1025 --> This Result is frong. The right result must be 1025

But Result[1024] = 1026 --> This Result is frong. The right result must be 2052

But Result[1024] = 1027 --> This Result is frong. The right result must be 4108

Result [1023999]

Here is my new complete code:

// includes, system

#include <stdlib.h>

#include <stdio.h>

#include <string.h>

#include <math.h>

// includes, project

#include <cufft.h>

#include <cutil_inline.h>

// Complex data type

typedef float2 Complex;

static global void window(float , float , int, int);

unsigned int timer = 0;

float elapsedTimeInMs = 0.0f;

////////////////////////////////////////////////////////////////////////////////

// declaration, forward

void runTest(int argc, char** argv);

// The filter size is assumed to be a number smaller than the signal size

#define SIGNAL_SIZE 1024*1000

#define FILTER_SIZE 1024

////////////////////////////////////////////////////////////////////////////////

// Program main

////////////////////////////////////////////////////////////////////////////////

int main(int argc, char** argv)

{

runTest(argc, argv);

cutilExit(argc, argv);

}

////////////////////////////////////////////////////////////////////////////////

//! Run a simple test for CUDA

////////////////////////////////////////////////////////////////////////////////

void runTest(int argc, char** argv)

{

if( cutCheckCmdLineFlag(argc, (const char**)argv, “device”) )

    cutilDeviceInit(argc, argv);

else

    cudaSetDevice( cutGetMaxGflopsDeviceId() );

// Allocate host memory for the waveform

float* h_waveform = (float*)malloc(sizeof(float) * SIGNAL_SIZE);

float* h_coeff = (float*)malloc(sizeof(float) * FILTER_SIZE);

// Initalize the memory for the signal

for(unsigned int i = 0; i < SIGNAL_SIZE; ++i) 

{        

    //h_waveform[i] = 2.4374328223;

    h_waveform[i] = i;

}

// Initalize the memory for the signal

for(unsigned int i = 0; i < FILTER_SIZE; ++i) 

{            

    //h_coeff[i] = 4.312312312;

    h_coeff[i] = i;

}

for(unsigned int i = 1019; i < 1019 + 10; ++i)

{                        

    printf("Waveform: %f \n", h_waveform[i]);        

}

for(unsigned int i = 0; i < 10; ++i)

{                        

    printf("Coefficiens: %f \n", h_coeff[i]);        

}

cutilCheckError( cutCreateTimer( &timer ) );

int mem_size_waveform = sizeof(float) * SIGNAL_SIZE;

int mem_size_coeff = sizeof(float) * FILTER_SIZE;

// Allocate device memory for signal

float* d_waveform;

cutilSafeCall(cudaMalloc((void**)&d_waveform, mem_size_waveform));

float* d_coeff;

cutilSafeCall(cudaMalloc((void**)&d_coeff, mem_size_coeff));

// Copy waveform from CPU memory to GPU memory

cutilSafeCall(cudaMemcpy(d_waveform, h_waveform, mem_size_waveform, cudaMemcpyHostToDevice));

// Copy coeff from CPU memory to GPU memory

cutilSafeCall(cudaMemcpy(d_coeff, h_coeff, mem_size_coeff, cudaMemcpyHostToDevice));

//den Timer starten

cutilCheckError( cutStartTimer( timer));

window<<<SIGNAL_SIZE/512, 512>>>(d_waveform, d_coeff, 1000, 1024);

//den Timer anhalten

cutilCheckError( cutStopTimer( timer));

elapsedTimeInMs = cutGetTimerValue( timer);

// Check if kernel execution generated and error

cutilCheckMsg("Kernel execution failed [ ComplexPointwiseMulAndScale ]");

// Allocate host memory for the result

float* h_result = (float*)malloc(sizeof(float) * SIGNAL_SIZE);

// Copy device memory to host

cutilSafeCall(cudaMemcpy(h_result, d_waveform, mem_size_waveform, cudaMemcpyDeviceToHost));

// Das Ergebnis anzeigen

for (unsigned int i = 1019; i < 1019+10; ++i) 

{

    printf(" Ergebnis: %2.20f \n", h_result[i]);        

}

printf("\nZeit: %f \n", elapsedTimeInMs);

// cleanup memory

free(h_waveform);

free(h_coeff);      

cutilSafeCall(cudaFree(d_waveform));

cutilSafeCall(cudaFree(d_coeff));

cudaThreadExit();

}

// Betrag berechnen

global void window(float waveform, float coeff, int n_waveform, int number_of_coefficients)

{

int tidx = threadIdx.x + blockIdx.x*blockDim.x;

int tidy = threadIdx.y + blockIdx.y*blockDim.y;

if( tidx >= number_of_coefficients )

{

    return;

}

if( tidy >= n_waveform )

{

    return;

}

waveform[tidy * number_of_coefficients + tidx] = waveform[tidy * number_of_coefficients + tidx] * coeff[tidx];

}

Hi,

you have to launch the kernel in a 2-dimensional grid. I already mentioned :ph34r: . It is something like:

dim3 dimGrid;

dim3 dimBlock;

dimBlock.x = 512;

dimBlock.y = 500;

dimGrid.x = 1024/dimBlock.x;

dimGrid.y = 1000/dimBlock.y;

window<<<dimGrid, dimBlock>>>(...)

Refer to the Programming Guide (Section 2.2).

dim3 dimGrid;
dim3 dimBlock;

dimBlock.x = 512;
dimBlock.y = 500;

dimGrid.x = 1024/dimBlock.x;
dimGrid.y = 1000/dimBlock.y;

window<<<dimGrid, dimBlock>>>(d_waveform, d_coeff, 1000, 1024);

Beim compilieren wurde kein Fehler angezeigt, jedoch stürtzt das Programm ab. Ich denke mal irgendwo ist ein Zahlendreher, das in den falschen Bereich geschrieben wird.

Hier ist die Fehlermeldung:

Eine Ausnahme (erste Chance) bei 0x7c812afb in simpleCUFFT.exe: Microsoft C+±Ausnahme: cudaError an Speicherposition 0x0012fe84…
Das Programm “[1680] simpleCUFFT.exe: Systemeigen” wurde mit Code -1 (0xffffffff) beendet.