I have a question about using shared memory arrays in CUDA.

I have 3 int arrays in global memory: A B and C and their length is 10000.

I have 3 more int arrays in global memory: A_centroids, B_centroids, C_centroids and their lenght is 10.

You can think A[i], B[i], C[i] as a Point in a R3 space . Same thing for A_Centroids[i], B_Centroids[i], C_Centroids[i].

The aim is to compute first the euclidean distance between each element of A, B, C and each element of A_centroids, B_centroids, C_centroids.

And then find the minumum distance between each point A[i] B[i] C[i] and all the elements of A_Centroids, B_Centroids, C_Centroids and save in labelArray (in global memory) the index of the centroid nearest to A[i] B[i] C[i].

For example if A_Centroids[2], B_Centroids[2], C_Centroids[2] is the nearest point to

A[1] B[1] C[1], I will save : labelArray[1] = 2 in global memory.

GRID DIM:

dim3 dimGRID(128,128);

dim3 dimBLOCK(16,16);

I use a 2D grid: 128x128 blocks

and a 2D block: 16x16 threads (so 256 threads in a block)

I have already implemented this algorithm using global memory and it works, but it’s too slow. I’d like to use shared memory.

My question is: how can I properly use different thread index’s linearizations to perform the algorithm ?

Since there are only 256 threads in a block I can load only 256 elements of A, B, C in shared variables that I call:

**shared** int A_shared[256*sizeof(int)];
shared int B_shared[256*sizeof(int)];

**shared**int C_shared[256*sizeof(int)];

A global thread index threadID could be:

int blockID = blockIdx.x + blockIdx.y * gridDim.x;

int threadID = blockID * (blockDim.x * blockDim.y) + (threadIdx.y * blockDim.x) + threadIdx.x;

And a local one?

int localThreadIndex = threadIdx.x + threadIdx.y*blockDim.x

First block should take the first 256 elements of A, B, C from 0 to 255

the second block should take the elements from 256 to 511 and so on.

here there’s my example but it doesn’t work:

```
__global__ void minDistance(int *A,int *B,int *C,int *A_Centroids,int *B_Centroids,int *C_Centroids,int *labelArray) {
// global thread Index for a thread in a 2D Block in a 2D Grid
// from 0 to 9999
int threadID = (threadIdx.x + blockIdx.x * blockDim.x) + (threadIdx.y + blockIdx.y * blockDim.y) * blockDim.x * gridDim.x;
// local thread Index in a 2D Block
// from 0 to 255
int localThreadIndex = threadIdx.x + threadIdx.y*blockDim.x;
// initial min value
double min = 500.0;
// current distance value
double currentValue = 0.0;
// index of the nearest centroid
int index = 0;
//init Shared variable arrays each 256 elements long
__shared__ int A_shared[256 * sizeof(int)];
__shared__ int B_shared[256 * sizeof(int)];
__shared__ int C_shared[256 * sizeof(int)];
//init Shared Centroid arrays each 10 elements long
__shared__ int A_Centroids_shared[10 * sizeof(int)];
__shared__ int B_Centroids_shared[10 * sizeof(int)];
__shared__ int C_Centroids_shared[10 * sizeof(int)];
// here I'm copying 256 elements from global memory to shared memory
A_shared[localThreadIndex] = A[threadID];
B_shared[localThreadIndex] = B[threadID];
C_shared[localThreadIndex] = C[threadID];
// Centroid shared arrays are only 10 elements long
// so I can use only 10 threads
if(localThreadIndex < 10){
A_Centroids_shared[localThreadIndex] = A_Centroids[threadID];
B_Centroids_shared[localThreadIndex] = B_Centroids[threadID];
C_Centroids_shared[localThreadIndex] = C_Centroids[threadID];
}
__syncthreads();
// the triple A_shared[i] B_shared[i], C_shared[i] must compute the distance from
// each triple of A_Centroids_shared, B_Centroids_shared, C_Centroids_shared
for(int i = 0; i < nCentroids; i++) {
// current distance
currentValue = sqrt(pow((A_shared[threadInBlock]-A_Centroids_shared[i]),2.0) + pow((B_shared[threadInBlock]-B_Centroids_shared[i]),2.0) + pow((C_shared[threadInBlock]-C_Centroids_shared[i]),2.0));
//check if this value is < min
//if yes I update min
if(currentValue < min){
min = value;
index = i;
}
}// end for
// Saving in global memory the index of the nearest centroid to my current triple A_shared, B_shared, C_shared
// Note that I'm using a global thread index
labelArray[threadID] = index;
__syncthreads();
}
```

Thank you so much