Hi everyone,

I’m pretty new to CUDA, so please be gentle.

I’m benchmarking my OpenMP routine against my CUDA routine, and I seem to have run into a problem. I’ve successfully multithreaded it with CUDA, but my perforamnce stinks.

Here is the kernel

```
__global__ void
EDCalc( float* genome,float* edspacing,float* rhoarray, float2* dnk, float* distarray, int totalpts, int refllayers, float roughness, float rho, int ptsperthread )
{
int index = blockIdx.x*blockDim.x+threadIdx.x;
if(index < ptsperthread)
{
float temp = 0;
float dist = 0;
if(index == 0)
{
for(int i = 0; i < refllayers; i++)
{
rhoarray[i] = (genome[i+1]-genome[i])*rho/2.0f;
}
}
syncthreads();
temp = 0.0f;
for(int k = 0; k < refllayers; k++)
{
dist = (edspacing[index]-distarray[k] )*roughness;
if(dist > 6.0f)
temp += (rhoarray[k])*(2.0f);
else if (dist > -6.0f)
temp += (rhoarray[k])*(1.0f+erff(dist));
dnk[index].x = temp;
}
}
}
```

Against my CPU version, the function takes ~160 microseconds. With CUDA, it takes ~940 microseconds. Unless I comment out dnk[index].x = temp. Then all of the math takes 48 microseconds (a nice improvement), but I’m not storing anything. I need to store it for the next step in the calculation. Any ideas how to speed this up? I’m allocating that memory with

```
CUDA_SAFE_CALL(cudaMalloc((void**) cudank, npts*sizeof(float2)));
```

and it does not get copied back to the CPU side. I’ve attached the project if that helps.

I know this isn’t optimized as well as it could be, but I just want to get the proof of concept working. Thanks for any help.

CUDAtest.rar (1.63 MB)