Hello, my name is Alan and this is my first post, I’m currently working at my university assisting a professor with his research. He gave the task to program a simple function for its program, but it has to be optimized for parallel computing. This is my first week of coding under this architecture, I already read the programming guide and best practices guide and took a look at GPU Gems 3, but I’m still unable to make this function work.

It is a simple one, it has the form of:

e^(-gamma*(x-y)(x-y))

Where gamma is a constant, and multiplies the dot product of (x-y) and (x-y)

My professor told me that I’m not supposed to use any of the math functions, in other words I need to recode everything, so I made use of taylor series to create the e function, and I’m also considering the error for truncating the series. I also programmed the factorial function, but i think that wasn’t really necessary and I’m thinking that I can get rid of it, and I spent almost a day to program the dot product function when I found the cublasSdot function that made exactly what I wanted, but now that I’m trying to put everything together, it doesn’t even compile. I’ve been making changes to the code all day long and I can’t get it, I now that my current code is more than wrong, but I’m just stocked and I hope some of you could show me the right direction. I’m not asking to do the job for me, just show what am I doing wrong.

I’ve read most of the whitepapers in the SDK section, and I’ve runned all the examples, but when I tried to implement a concept shown there i simply can’t.

This is my code:

[codebox]

#include <stdio.h>

#include “cublas.h”

long fact(int n)

{

```
if(n==1)
return 1;
else
return n*fact(n-1);
```

}

float exp(long* x)

{

return 1+x+(x*x)/fact(2)+(x*x*x)/fact(3)+(x*x*x*x)/fact(4);

}

float gaussian(float* gamma, float* X, float* Y, float* N)

{

return exp(-gamma*cublasSdot(N,X-Y,1,X-Y,1));

}

int main()

{

cublasStatus status;

// Kernel invocation with N threads

int N=10;

float dp;

size_t size = N*sizeof(float);

float* h_A = (float*)malloc(size);

float* h_B = (float*)malloc(size);

float *d_A;

cudaMalloc((void**)&d_A,size);

float *d_B;

cudaMalloc((void**)&d_B,size);

status = cublasInit();

```
if (status != CUBLAS_STATUS_SUCCESS) {
fprintf (stderr, "CUBLAS initialization error\n");
return EXIT_FAILURE;
}
```

//Invoke kernel

dp = cublasSdot(N,d_A,1,d_B,1);

printf("%f",dp);

cudaMemcpy(d_A,h_A,size,cudaMemcpyHostToDevice);

cudaMemcpy(d_B,h_B,size,cudaMemcpyHostToDevice);

//Free device memory

```
cudaFree(d_A);
cudaFree(d_B);
```

}

[/codebox]