How to "wrap" function calls for running in CUDA ? Wrapping functions from ordinary C++...

How to “wrap” function calls for running in CUDA ?

Could anyone help me how to “wrap” the function calls, so that those function calls will be executed in GPU?

I have C++ program already running ok, but I want to run some portions of the program in GPU… But as you already know, the functions that run as CUDA on GPU must be compiled by nvcc, and so on…

I want to compile only that portion by nvcc, and the rest as ordinary gcc…

I am using Linux… and I am a newbie. I would like to hear some explanations on the steps to compile… I really appreciate it…

Thank you

PS. Could you just provide simple code for wrapping it, and compile instructions ?

I include a program here that does a kind of Fourier transformation

by direct summation. The code runs both on the CPU and the GPU:

the GPU (8600GTS) is about 160x faster than the CPU (Athlon X2 5600+).

Using Intel’s Vector Math Library the CPU code can be made to run about 2x

faster, but the CPU is still about 80x slower for sufficiently large problems.

This is a real world example from crystallography, runs at about 46 GFlops

on the GPU (counting 4 flops each for cos/exp/sin) and 2.8 GFlops on the CPU

(56 flops for cos/exp/sin). This is close to peak for 1 flop/cycle on both

the CPU and GPU. There are good reasons btw for not using FFTs for this.

GPU:

global void gpuSUMAniso2

(float *A, float *B,

 _F *H, _F *K, _F *L,

 _F x0, _F y0, _F z0, _F q0,

 _F b00, _F b01, _F b02, _F b03, _F b04, _F b05,

 _F x1, _F y1, _F z1, _F q1,

 _F b10, _F b11, _F b12, _F b13, _F b14, _F b15,

 _F *F0, _F *F1, _T size)

{

float U0, U1, f0, f1, g0, g1;

unsigned int i;

_T tid = blockDim.x * blockIdx.x + threadIdx.x;

_T tsz = blockDim.x * gridDim.x;

for (i = tid; i < size; i += tsz)

{

    U0 = H[i] * x0 + K[i] * y0 + L[i] * z0;

    U1 = H[i] * x1 + K[i] * y1 + L[i] * z1;

    f0 = b00 * H[i] * H[i] + b01 * K[i] * K[i] + b02 * L[i] * L[i];

    g0 = b03 * H[i] * K[i] + b04 * H[i] * L[i] + b05 * K[i] * L[i];

    f1 = b10 * H[i] * H[i] + b11 * K[i] * K[i] + b12 * L[i] * L[i];

    g1 = b13 * H[i] * K[i] + b14 * H[i] * L[i] + b15 * K[i] * L[i];

    f0 = F0[i] * q0 * expf(-(f0 + 2.f * g0));

    f1 = F1[i] * q1 * expf(-(f1 + 2.f * g1));

    A[i] += (f0 * cosf(U0) + f1 * cosf(U1));

    B[i] += (f0 * sinf(U0) + f1 * sinf(U1));

}

}

CPU:

void gpuSUMAniso2

(float *A, float *B,

 _F *H, _F *K, _F *L,

 _F x0, _F y0, _F z0, _F q0,

 _F b00, _F b01, _F b02, _F b03, _F b04, _F b05,

 _F x1, _F y1, _F z1, _F q1,

 _F b10, _F b11, _F b12, _F b13, _F b14, _F b15,

 _F *F0, _F *F1, _T N)

{

float U0, U1, f0, f1, g0, g1;

unsigned int i;

for (i = 0; i < N; i++)

{

    U0 = H[i] * x0 + K[i] * y0 + L[i] * z0;

    U1 = H[i] * x1 + K[i] * y1 + L[i] * z1;

    f0 = b00 * H[i] * H[i] + b01 * K[i] * K[i] + b02 * L[i] * L[i];

    g0 = b03 * H[i] * K[i] + b04 * H[i] * L[i] + b05 * K[i] * L[i];

    f1 = b10 * H[i] * H[i] + b11 * K[i] * K[i] + b12 * L[i] * L[i];

    g1 = b13 * H[i] * K[i] + b14 * H[i] * L[i] + b15 * K[i] * L[i];

    f0 = F0[i] * q0 * expf(-(f0 + 2.f * g0));

    f1 = F1[i] * q1 * expf(-(f1 + 2.f * g1));

    A[i] += (f0 * cosf(U0) + f1 * cosf(U1));

    B[i] += (f0 * sinf(U0) + f1 * sinf(U1));

}

}

As you can see the difference is minimal. _F btw is a typedef for

const float, _T for const unsigned int:

typedef const float _F;

typedef const unsigned int _T;