Hi,

i’m actually working on parallelization of a small CPU+host code which implies 2 big arrays:

```
#define NPTS1 10000
#define NPTS2 10000
typedef struct pt3D {
float x;
float y;
float z;
} pt3D;
pt3D cloud1[NPTS1];
pt3D cloud2[NPTS2];
```

For the moment these arrays are initialized with random values

and we want to get the sum of the euclidean distances (without the sqrt) of each single cloud1 point regarding all cloud2

```
for(i=0; i<NPTS1; i++) {
for(j=0; j<NPTS2; j++) {
sumCPU += (cloud1[i].x - cloud2[j].x) * (cloud1[i].x - cloud2[j].x) +
(cloud1[i].y - cloud2[j].y) * (cloud1[i].y - cloud2[j].y) +
(cloud1[i].z - cloud2[j].z) * (cloud1[i].z - cloud2[j].z);
}
}
```

sumCPU contains our final wanted result. We work in single precision float.

I’m facing troubles coding cuda, i decomposed the problem in 2 steps:

1/ One kernel will do the calculus of all the distances and store the result in a big array sum_d of size NPTS1*NPTS2*sizeof(float) in global memory (for the moment i don’t look for anykind of optimization i just want to have something accurate)

```
__global__ void calc_dist(pt3D * c1, pt3D * c2, float* sum_d)
{
unsigned int i=blockDim.x*blockIdx.x + threadIdx.x;
unsigned int j=blockDim.y*blockIdx.y + threadIdx.y;
if(i < NPTS1 and j < NPTS2)
{
sum_d[i*NPTS2 + j] = (c1[i].x-c2[j].x)*(c1[i].x-c2[j].x) +
(c1[i].y-c2[j].y)*(c1[i].y-c2[j].y) +
(c1[i].z-c2[j].z)*(c1[i].z-c2[j].z);
}
}
```

on a grid with blocksPerGrid(NPTS1/16+1,NPTS2/16+1) and threadsPerBlock(16,16) parameters

2/ i use cublas library on the sum_d array to get the actual sum result in an pointer float sumGPUp

```
status = cublasSasum(handle,NPTS1*NPTS2,sum_d,1,sumGPUp);
```

first problem:

regarding values of NPTS1 and NPTS2 i get problems with final results, at the moment i tested the code for NPTS1 and NPTS2 multiple of 10,

if NPTS1<1000 and NPTS2 <10000 there is no big troubles

```
GPU TIME : 0.010257 seconds
CPU TIME : 0.020978 seconds
GAIN : 2.045197
relative error CPU/GPU: 1.247682 %
STOP GPU: 4.548740e+15
STOP CPU: 4.492685e+15
```

but if i take NPTS1=NPTS2=10000:

```
GPU TIME : 0.061120 seconds
CPU TIME : 0.205745 seconds
GAIN : 3.366221
relative error CPU/GPU: <b>96.780182 %</b>
STOP GPU: 4.501012e+16
STOP CPU: 2.287330e+16
```

The final goal is to use NPTS1=NPTS2=1000 000 !

second problem:

facing troubles to use cublasSasum function on a NPTS1*NPTS2= 1 000 000 000 000 floats size, i don’t know what is the limit to use this cublas function.

Thanks a lot for you precious help and feedback,

Pascal