Implement this function x^y

Hi @ all

It is possible to implement the function x^y in cuda?

X is a float1 vector and y is a float constant.

Here is a example:

float y = 0.234;

float1 x ={1,2,3,4,5};

x1= 1^0.234;

x2= 2^0.234

x3= 3^0.234

x4= 4^0.234

x5= 5^0.234

__global__ powerVector(float *in, float* out, float y, unsigned int length)


   tid = threadIdx.x + blockIdx.x*blockDim.x;

if( tid >= length )


out[tid] = powf(in[tid], y);


Have a look at this forum topic.

Oh, this is your post!!! :wacko:

Please, read the CUDA Programming Guide. :thumbup:

I dont understand this code segment.

if( tid >= length )

Is this function abort if the length is to large?

For example the x vector has 512*256 elements, then I call the function “powerVector” with this code:

powerVector <<<512*256/512,512>>> --> How great is the variable tid? (I have a NVIDIA 9800GXT+ GPU)

The tid equals:

0 for the first thread of block 0,

1 for the second thread of block 0,




511 for the last thread of block 0,

512 for the first thread of block 1,




(512*256/512-1)512 + 511 for the last thread of block 512256/512 -1 (last block).

INDEED the tid IS threadIdx.x + blockIdx.x*blockDim.x;

blockIdx.x is the number of the current block (0 … 512*256/512-1)

blockDim.x is the number of threads per block (512)

threadIdx.x is the number of the current thread within th current block (0 … 512-1).

When length is 130 and blocksize is 128 you may launch your kernel with <<<2, 128>>> so there are threads whose tid is greater than length.


Thanks for your Information. I think the Programming Guide is very bad for newbie.

If it possible to take the double precision pow in cuda 2.1?

Maximum ulp error pow = 7 --> single precision
Maximum ulp error pow = 2 --> double precision

my idea:

  1. copy input data (datatype float) in second memory (datatype double)
  2. use the pow function (double precision)
  3. copy the results back in output data memory (datatype float) --> necessary to save bandwidth (copy results from GPU to CPU)

It is good enough to answer most of your questions. ;-)

Don’t forget to convert. Memcpy from

float *a


double *b

won’t work.

AGAIN: READ A BOOK (about the C Programming language).

You may try this:

__global__ powerVector(float *in, float* out, float y, unsigned int length)


   tid = threadIdx.x + blockIdx.x*blockDim.x;

if( tid >= length )


out[tid] = (float) YOUR_DOUBLE_PREC_POW( (double)in[tid], y);


Cave: Storing the results in single precision may result in rounding errors. But this need not be a disadvantage.

–Edit: Meh, ignore me. If people feel like helping you, good for you.

Thanks I will your program.

If i use float datatype the result is 3^0,0135 = 1,0149417 but I need a better precision.


If I use double precision datatype 3^0,0135 =0.0 --> Have cuda a problem with double precision datatype?

Here is my Code[/b]

/* Example showing the use of CUFFT for fast 1D-convolution using FFT. */

// includes, system

#include <stdlib.h>

#include <stdio.h>

#include <string.h>

#include <math.h>

// includes, project

#include <cufft.h>

#include <cutil_inline.h>

static global void absolute_value(double*, double*, double, unsigned int);


// declaration, forward

void runTest(int argc, char** argv);

// The filter size is assumed to be a number smaller than the signal size

#define SIGNAL_SIZE 512*256

#define TEST 0.0135


// Program main


int main(int argc, char** argv)


runTest(argc, argv);

cutilExit(argc, argv);



//! Run a simple test for CUDA


void runTest(int argc, char** argv)


if( cutCheckCmdLineFlag(argc, (const char**)argv, “device”) )

    cutilDeviceInit(argc, argv);


    cudaSetDevice( cutGetMaxGflopsDeviceId() );

// Allocate host memory for the signal

double* h_signal_a = (double*)malloc(sizeof(double) * SIGNAL_SIZE);

double* h_results = (double*)malloc(sizeof(double) * SIGNAL_SIZE);

// Initalize the memory for the signal

for(unsigned int i = 0; i < SIGNAL_SIZE; ++i) 


    h_signal_a[i] = 3.0;                   


for(unsigned int i = 0; i < 10; ++i)


    printf("Wert: %f \n", h_signal_a[i]);                


int mem_size = sizeof(double) * SIGNAL_SIZE;

// Allocate device memory for signal

double* d_signal_a;

cutilSafeCall(cudaMalloc((void**)&d_signal_a, mem_size));

double* d_signal_b;

cutilSafeCall(cudaMalloc((void**)&d_signal_b, mem_size));

// Copy host memory to device

cutilSafeCall(cudaMemcpy(d_signal_a, h_signal_a, mem_size, cudaMemcpyHostToDevice));

// Multiply the coefficients together and normalize the result

absolute_value<<<SIGNAL_SIZE/512, 512>>>(d_signal_a,d_signal_b,TEST,SIGNAL_SIZE);

// Check if kernel execution generated and error

cutilCheckMsg("Kernel execution failed [ absolute_value ]");

// Copy device memory to host

cutilSafeCall(cudaMemcpy(h_results, d_signal_b, mem_size, cudaMemcpyDeviceToHost));

// Das Ergebnis anzeigen

for (unsigned int i = 0; i < 10; ++i) 


    printf(" Ergebnis: %2.20f \n", h_results[i]);        


// cleanup memory





// Betrag berechnen

global void absolute_value(double in, double out, double y, unsigned int length)


int tid = threadIdx.x + blockIdx.x*blockDim.x;

if( tid >= length )




//out[tid] = pow(in[tid], y);

out[tid] = powf(in[tid], y);


Double precision is only available on compute capability 1.3 devices (GTX200 based products).

As others have advised you on many occasions, please read the documentation…

At the moment I have a 9800GTX+ but in 2 weeks i have a 285gtx. The I will test this again.