# Implement this function x^y

Hi @ all

It is possible to implement the function x^y in cuda?

X is a float1 vector and y is a float constant.

Here is a example:

float y = 0.234;

float1 x ={1,2,3,4,5};

x1= 1^0.234;

x2= 2^0.234

x3= 3^0.234

x4= 4^0.234

x5= 5^0.234

``````__global__ powerVector(float *in, float* out, float y, unsigned int length)

{

if( tid >= length )

return;

out[tid] = powf(in[tid], y);

}
``````

Have a look at this forum topic.

Oh, this is your post!!! :wacko:

I dont understand this code segment.

if( tid >= length )
{
return;
}

Is this function abort if the length is to large?

For example the x vector has 512*256 elements, then I call the function “powerVector” with this code:

powerVector <<<512*256/512,512>>> --> How great is the variable tid? (I have a NVIDIA 9800GXT+ GPU)

The tid equals:

0 for the first thread of block 0,

1 for the second thread of block 0,

.

.

.

511 for the last thread of block 0,

512 for the first thread of block 1,

.

.

.

(512*256/512-1)512 + 511 for the last thread of block 512256/512 -1 (last block).

INDEED the tid IS threadIdx.x + blockIdx.x*blockDim.x;

blockIdx.x is the number of the current block (0 … 512*256/512-1)

blockDim.x is the number of threads per block (512)

threadIdx.x is the number of the current thread within th current block (0 … 512-1).

When length is 130 and blocksize is 128 you may launch your kernel with <<<2, 128>>> so there are threads whose tid is greater than length.

Thanks for your Information. I think the Programming Guide is very bad for newbie.

If it possible to take the double precision pow in cuda 2.1?

Maximum ulp error pow = 7 --> single precision
Maximum ulp error pow = 2 --> double precision

my idea:

1. copy input data (datatype float) in second memory (datatype double)
2. use the pow function (double precision)
3. copy the results back in output data memory (datatype float) --> necessary to save bandwidth (copy results from GPU to CPU)

Don’t forget to convert. Memcpy from

float *a

to

double *b

won’t work.

You may try this:

``````__global__ powerVector(float *in, float* out, float y, unsigned int length)

{

if( tid >= length )

return;

out[tid] = (float) YOUR_DOUBLE_PREC_POW( (double)in[tid], y);

}
``````

Cave: Storing the results in single precision may result in rounding errors. But this need not be a disadvantage.

–Edit: Meh, ignore me. If people feel like helping you, good for you.

If i use float datatype the result is 3^0,0135 = 1,0149417 but I need a better precision.

[b]

If I use double precision datatype 3^0,0135 =0.0 --> Have cuda a problem with double precision datatype?

Here is my Code[/b]

/* Example showing the use of CUFFT for fast 1D-convolution using FFT. */

// includes, system

#include <stdlib.h>

#include <stdio.h>

#include <string.h>

#include <math.h>

// includes, project

#include <cufft.h>

#include <cutil_inline.h>

static global void absolute_value(double*, double*, double, unsigned int);

////////////////////////////////////////////////////////////////////////////////

// declaration, forward

void runTest(int argc, char** argv);

// The filter size is assumed to be a number smaller than the signal size

#define SIGNAL_SIZE 512*256

#define TEST 0.0135

////////////////////////////////////////////////////////////////////////////////

// Program main

////////////////////////////////////////////////////////////////////////////////

int main(int argc, char** argv)

{

``````runTest(argc, argv);
``````

cutilExit(argc, argv);

}

////////////////////////////////////////////////////////////////////////////////

//! Run a simple test for CUDA

////////////////////////////////////////////////////////////////////////////////

void runTest(int argc, char** argv)

{

if( cutCheckCmdLineFlag(argc, (const char**)argv, “device”) )

``````    cutilDeviceInit(argc, argv);

else

cudaSetDevice( cutGetMaxGflopsDeviceId() );
``````

// Allocate host memory for the signal

``````double* h_signal_a = (double*)malloc(sizeof(double) * SIGNAL_SIZE);

double* h_results = (double*)malloc(sizeof(double) * SIGNAL_SIZE);
``````

// Initalize the memory for the signal

``````for(unsigned int i = 0; i < SIGNAL_SIZE; ++i)

{

h_signal_a[i] = 3.0;

}
``````

for(unsigned int i = 0; i < 10; ++i)

``````{

printf("Wert: %f \n", h_signal_a[i]);

}
``````

int mem_size = sizeof(double) * SIGNAL_SIZE;

// Allocate device memory for signal

``````double* d_signal_a;

cutilSafeCall(cudaMalloc((void**)&d_signal_a, mem_size));
``````

double* d_signal_b;

``````cutilSafeCall(cudaMalloc((void**)&d_signal_b, mem_size));
``````

// Copy host memory to device

``````cutilSafeCall(cudaMemcpy(d_signal_a, h_signal_a, mem_size, cudaMemcpyHostToDevice));
``````

// Multiply the coefficients together and normalize the result

``````absolute_value<<<SIGNAL_SIZE/512, 512>>>(d_signal_a,d_signal_b,TEST,SIGNAL_SIZE);
``````

// Check if kernel execution generated and error

``````cutilCheckMsg("Kernel execution failed [ absolute_value ]");
``````

// Copy device memory to host

``````cutilSafeCall(cudaMemcpy(h_results, d_signal_b, mem_size, cudaMemcpyDeviceToHost));
``````

// Das Ergebnis anzeigen

``````for (unsigned int i = 0; i < 10; ++i)

{

printf(" Ergebnis: %2.20f \n", h_results[i]);

}
``````

// cleanup memory

``````free(h_signal_a);

cutilSafeCall(cudaFree(d_signal_a));
``````

}

// Betrag berechnen

global void absolute_value(double in, double out, double y, unsigned int length)

{

``````int tid = threadIdx.x + blockIdx.x*blockDim.x;
``````

if( tid >= length )

``````{

return;

}
``````

//out[tid] = pow(in[tid], y);

``````out[tid] = powf(in[tid], y);
``````

}

Double precision is only available on compute capability 1.3 devices (GTX200 based products).