Interfacing CUDA with a C++ code

Ok, completely new to CUDA here, toyed with the examples a little, but now I need to put it in some code.

Basically, I have a C++ code that does the following…

for (j=0; j<1000000; j++){
x2[j] = R11x1[j] + R12px1[j] + R16dp11[j]
px2[j] = R21
x1[j] + R22px1[j] + R26dp[j]

This is used to track the positions of particles in a particle accelerator. What I want to do is to interface CUDA to do this part for me. Am I better to write CUDA code within the project and use it from there, or is it possible to import a CUDA project into the solution as a DLL and call the function from there? For reference, I am using an nVidia 8600m GT, Windows XP 32 bit and Visual Express C++ 2008.

Walkthroughs and example code greatly appreciated. I almost have *.dll working, but my experience with library conflicts is very limited



global void big_calc(float *x2, float *px2, float *x1, float *px1, float *dp11, float *dp, float R11, float R21, float R12, float R22, float R16, float R26, unsigned int num_iter)
int j = threadId.x + _mul24(blockIdx.x, blockDim.x);
if (j >= num_iter)

float x1_val = x1[j];
float px1_val = px1[j];

x2[j] = R11x1_val + R12px1_val + R16dp11[j];
px2[j] = R21
x1_val + R22px1_val + R26dp[j];

And you call it from your C(++) code as
big_calc<<dim3(ceil(num_iter/256), 1, 1), dim3(256,1,1)>>(x2, px2, x1, px1, dp11, dp, R11, R21, R12, R22, R16, R26, num_iter);

It will be memory-latency limited, but still much faster than C(++). I had a similar problem that was also heavily memory bound, which went 50 times faster as C code (without MMX/SSE/MKL)