Implement this function (x^2+y^2)/4096

Hi @ all

After a CUFFT (real to complex) I have 2 millions points from datatype float 2 in the global memory. Now I want to calculate the absolute value with the following function:

(x^2+y^2)/4096 → the number 4096 is a constant

Can anybody help me to solve this problem? The results shall be stored in a new memory.

My idee is:

  1. square all 2 million values.
  2. add x^2 and y^2 and save in a new global memory → now I have 1 million values
  3. divide all 1 million values by 4096.

It is better to implement this 3 steps in one or three kernelfunctions?

__global__ convert(float2 *indata, float*out_data, unsigned int num_values) {

index = threadIdx.x + blockIdx.x*blockDim.x;

float2 val = indata[index];

out_data[index] = (val.x * val.x + val.y * val.y) / 4096.0f;

}

This should do what you want.

Thanks for your answer. I will test in some hours.

Avoid the division by passing 1/4096 to the kernel and multiply by that value instead? Dont know if the compiler is smart enough to do it on its own.

I don’t think it is smart enough to do division by itself, but the divide by multiplying works by overflowing integers, not floats.

Here is the page anyways: [url=“Hackersdelight.org”]Hackersdelight.org

I didnt mean any integer magic, i just meant call the kernel with k<<<>>>(stuff,stuff,1.0f/4096f)

k(stuff,stuff,invconstant)
{


something*invconstant;
}

Does it really matter? That kernel is going to memory bandwidth bound anyways.

You might be better leaving it as an integer division by 4096, which could be converted to a subtraction on the exponent of the floating point number. Dependent on the exact hardware details of the GPU’s execution units, that might be faster. You’d have to try it and see.

Of course, as MisterAnderson points out, this particular kernel is likely to be memory bound anyway…

This implementation does not work. I get an error during the compilation.

Here is my errorinformation: operator “=” matches these operands

Here is my complete code:

// includes, system

#include <stdlib.h>

#include <stdio.h>

#include <string.h>

#include <math.h>

// includes, project

#include <cufft.h>

#include <cutil_inline.h>

// Complex data type

typedef float2 Complex;

static global void absolute_value(Complex*);

////////////////////////////////////////////////////////////////////////////////

// declaration, forward

void runTest(int argc, char** argv);

// The filter size is assumed to be a number smaller than the signal size

#define SIGNAL_SIZE 512*256

////////////////////////////////////////////////////////////////////////////////

// Program main

////////////////////////////////////////////////////////////////////////////////

int main(int argc, char** argv)

{

runTest(argc, argv);

cutilExit(argc, argv);

}

////////////////////////////////////////////////////////////////////////////////

//! Run a simple test for CUDA

////////////////////////////////////////////////////////////////////////////////

void runTest(int argc, char** argv)

{

if( cutCheckCmdLineFlag(argc, (const char**)argv, “device”) )

    cutilDeviceInit(argc, argv);

else

    cudaSetDevice( cutGetMaxGflopsDeviceId() );

// Allocate host memory for the signal

Complex* h_signal_a = (Complex*)malloc(sizeof(Complex) * SIGNAL_SIZE);

// Initalize the memory for the signal

for(unsigned int i = 0; i < SIGNAL_SIZE; ++i) 

{

    h_signal_a[i].x = i;

    h_signal_a[i].y = 0;

printf(“Wert: %f \n”, h_signal_a[i].x);

}

int mem_size = sizeof(Complex) * SIGNAL_SIZE;

// Allocate device memory for signal

Complex* d_signal_a;

cutilSafeCall(cudaMalloc((void**)&d_signal_a, mem_size));

// Copy host memory to device

cutilSafeCall(cudaMemcpy(d_signal_a, h_signal_a, mem_size, cudaMemcpyHostToDevice));

absolute_value<<<SIGNAL_SIZE/512, 512>>>(d_signal_a);

// Check if kernel execution generated and error

cutilCheckMsg("Kernel execution failed [ ComplexPointwiseMulAndScale ]");

// Allocate host memory for the result

Complex* h_result = (Complex*)malloc(sizeof(Complex) * SIGNAL_SIZE);

// Copy device memory to host

cutilSafeCall(cudaMemcpy(h_signal_a, d_signal_a, mem_size, cudaMemcpyDeviceToHost));

// Das Ergebnis anzeigen

for (unsigned int i = 0; i < SIGNAL_SIZE; ++i) 

{

    printf(" Ergebnis: %f \n", h_signal_a[i].x);        

}

// cleanup memory

free(h_signal_a);   

cutilSafeCall(cudaFree(d_signal_a));

cudaThreadExit();

}

// Betrag berechnen

global void absolute_value(Complex *indata)

{

int index = threadIdx.x + blockIdx.x * blockDim.x;

float2 val = indata[index];

indata[index] <b>=</b> ((val.x * val.x + val.y * val.y)/4096.0f);    // <b>here is the error</b> the compiler dont know the "=" operand for float2

}

indata[index] = ((val.x * val.x + val.y * val.y)/4096.0f); // here is the error the compiler dont know the "=" operand for float2

It looks like you’re trying to assign a single value ((val.x * val.x + val.y * val.y)/4096.0f) to a float2.

I haven’t looked over the rest of the code but I’d guess you should either use just a standard float or do something like assign to indata[index].x

This is not relevant to your question, but still. Having smth like this:

#define SIGNAL_SIZE 512*256

in the code is asking for trouble.

Imagine what a statement

float mean = sum(values) / SIGNAL_SIZE;

would do.

I ALWAYS place a #define’d arithmetic expression into parenthesis.

float mean = sum(values)/SIGNAL_SIZE is not the same as (a^2+b^2)/4096

E.D. Riedijk Version works fine. If you have a better idea, then you can post it here,

@Shifter,

Being a shifter, you could have considered right-shifting bits 12 times to effectuate a divide by 4096.

What cudesnick tried to allude to is the associativity in the code sample.

E.g.

#define VALUE x+y

.

.

.

int a = b * VALUE;

This will result in (b*x)+y !

Further be aware of making mistakes using preprocessor macros like this and similar ones.

E.g. 10*(3/10) equals zero while 10*3/10 equals 3 .