CUDA Programming Question I need to figure out where my mistakes are.

Hello, my name is Alan and this is my first post, I’m currently working at my university assisting a professor with his research. He gave the task to program a simple function for its program, but it has to be optimized for parallel computing. This is my first week of coding under this architecture, I already read the programming guide and best practices guide and took a look at GPU Gems 3, but I’m still unable to make this function work.

It is a simple one, it has the form of:


Where gamma is a constant, and multiplies the dot product of (x-y) and (x-y)

My professor told me that I’m not supposed to use any of the math functions, in other words I need to recode everything, so I made use of taylor series to create the e function, and I’m also considering the error for truncating the series. I also programmed the factorial function, but i think that wasn’t really necessary and I’m thinking that I can get rid of it, and I spent almost a day to program the dot product function when I found the cublasSdot function that made exactly what I wanted, but now that I’m trying to put everything together, it doesn’t even compile. I’ve been making changes to the code all day long and I can’t get it, I now that my current code is more than wrong, but I’m just stocked and I hope some of you could show me the right direction. I’m not asking to do the job for me, just show what am I doing wrong.

I’ve read most of the whitepapers in the SDK section, and I’ve runned all the examples, but when I tried to implement a concept shown there i simply can’t.

This is my code:


#include <stdio.h>

#include “cublas.h”

long fact(int n)



            return 1;


            return n*fact(n-1);


float exp(long* x)


return 1+x+(xx)/fact(2)+(xxx)/fact(3)+(xxxx)/fact(4);


float gaussian(float* gamma, float* X, float* Y, float* N)


return exp(-gamma*cublasSdot(N,X-Y,1,X-Y,1));


int main()


cublasStatus status;

// Kernel invocation with N threads

int N=10;

float dp;

size_t size = N*sizeof(float);

float* h_A = (float*)malloc(size);

float* h_B = (float*)malloc(size);

float *d_A;


float *d_B;


status = cublasInit();

if (status != CUBLAS_STATUS_SUCCESS) {

    fprintf (stderr, "CUBLAS initialization error\n");

    return EXIT_FAILURE;


//Invoke kernel

dp = cublasSdot(N,d_A,1,d_B,1);




//Free device memory





You probably mixed device and host memory pointers in gaussian function. Also you mixed pointer arithmetic with vector. And what is type long? In main function you use only dot product, why do you need exponent?

It is better if you separate the host and divice code .e.g.:


__device__ void devF(...){


__global__ void kernel(...){



extern "C" void kernel_call(dim3 dimGrid, dim3 dimBlock, ...){

	 kernel<<<dimGrid, dimBlock>>>(...);


extern "C" void kernel_call(dim3 dimGrid, dim3 dimBlock, ...);


	   kernel_call(dimGrid, dimBlock,...);