local block thread index and global thread index using __shared__ variables ?

I have a question about using shared memory arrays in CUDA.

I have 3 int arrays in global memory: A B and C and their length is 10000.
I have 3 more int arrays in global memory: A_centroids, B_centroids, C_centroids and their lenght is 10.

You can think A[i], B[i], C[i] as a Point in a R3 space . Same thing for A_Centroids[i], B_Centroids[i], C_Centroids[i].

The aim is to compute first the euclidean distance between each element of A, B, C and each element of A_centroids, B_centroids, C_centroids.

And then find the minumum distance between each point A[i] B[i] C[i] and all the elements of A_Centroids, B_Centroids, C_Centroids and save in labelArray (in global memory) the index of the centroid nearest to A[i] B[i] C[i].

For example if A_Centroids[2], B_Centroids[2], C_Centroids[2] is the nearest point to
A[1] B[1] C[1], I will save : labelArray[1] = 2 in global memory.

GRID DIM:

dim3 dimGRID(128,128);
dim3 dimBLOCK(16,16);

I use a 2D grid: 128x128 blocks
and a 2D block: 16x16 threads (so 256 threads in a block)

I have already implemented this algorithm using global memory and it works, but it’s too slow. I’d like to use shared memory.

My question is: how can I properly use different thread index’s linearizations to perform the algorithm ?

Since there are only 256 threads in a block I can load only 256 elements of A, B, C in shared variables that I call:
shared int A_shared[256sizeof(int)];
shared int B_shared[256
sizeof(int)];
shared int C_shared[256*sizeof(int)];

A global thread index threadID could be:
int blockID = blockIdx.x + blockIdx.y * gridDim.x;
int threadID = blockID * (blockDim.x * blockDim.y) + (threadIdx.y * blockDim.x) + threadIdx.x;

And a local one?
int localThreadIndex = threadIdx.x + threadIdx.y*blockDim.x

First block should take the first 256 elements of A, B, C from 0 to 255
the second block should take the elements from 256 to 511 and so on.

here there’s my example but it doesn’t work:

__global__ void minDistance(int *A,int *B,int *C,int *A_Centroids,int *B_Centroids,int *C_Centroids,int *labelArray) {

// global thread Index for a thread in a 2D Block in a 2D Grid
    // from 0 to 9999
	int threadID = (threadIdx.x + blockIdx.x * blockDim.x) + (threadIdx.y + blockIdx.y * blockDim.y) * blockDim.x * gridDim.x;
	
	// local thread Index in a 2D Block
	// from 0 to 255
	int localThreadIndex = threadIdx.x + threadIdx.y*blockDim.x;

	// initial min value
	double min = 500.0;
	// current distance value
	double currentValue = 0.0;
	// index of the nearest centroid
	int index = 0;

	//init Shared variable arrays each 256 elements long
	__shared__ int A_shared[256 * sizeof(int)];
    __shared__ int B_shared[256 * sizeof(int)]; 
    __shared__ int C_shared[256 * sizeof(int)];
    //init Shared Centroid arrays each 10 elements long
    __shared__ int A_Centroids_shared[10 * sizeof(int)];
    __shared__ int B_Centroids_shared[10 * sizeof(int)];
    __shared__ int C_Centroids_shared[10 * sizeof(int)];

// here I'm copying 256 elements from global memory to shared memory
    A_shared[localThreadIndex] = A[threadID];
    B_shared[localThreadIndex] = B[threadID];
    C_shared[localThreadIndex] = C[threadID];

    // Centroid shared arrays are only 10 elements long
    // so I can use only 10 threads
    if(localThreadIndex < 10){
    	A_Centroids_shared[localThreadIndex] = A_Centroids[threadID];
    	B_Centroids_shared[localThreadIndex] = B_Centroids[threadID];
    	C_Centroids_shared[localThreadIndex] = C_Centroids[threadID];
    }

    __syncthreads();

		// the triple A_shared[i] B_shared[i], C_shared[i] must compute the distance from 
    		// each triple of A_Centroids_shared, B_Centroids_shared, C_Centroids_shared
    		for(int i = 0; i < nCentroids; i++) {

    			// current distance 
    			currentValue = sqrt(pow((A_shared[threadInBlock]-A_Centroids_shared[i]),2.0) + pow((B_shared[threadInBlock]-B_Centroids_shared[i]),2.0) + pow((C_shared[threadInBlock]-C_Centroids_shared[i]),2.0));
    			//check if this value is < min
    			//if yes I update min 
    			if(currentValue < min){
    				min = value;
                                index = i;
    				
    			}
    		}// end for

    		// Saving in global memory the index of the nearest centroid to my current triple A_shared, B_shared, C_shared
    		// Note that I'm using a global thread index
 		labelArray[threadID] = index;

    __syncthreads();

}

Thank you so much

Let me first repeat how I understand what you want to do:
You have 10000 points in 3D and you want to compute for each of them the distance to the 10 points (centroids). As a result you get an array of size 10000 that has the labels of the centroids.

I have 2 suggestions to optimize your code (without using shared memory)
a) If I understood correctly, each of the 10000 points is read only once (i.e. one thread read one 3D point). Then there is no need to put them in shared memory. Read them directly from global memory!
b) In each thread you loop over all centroids. If you put the centroids in shared memory you will get bank conflicts because all threads of a warp access the same element. But accessing the same element is perfect for constant memory! I recommend to put the centroids in constant memory.

HI,
thank you for your answer !
Yes you understood correctly.
With with Nvidia Visual Profiler I noticed that minDistance (first step of kmeans algorithm) is very slow and I was wandering how to speed up this step.
a) Yes I read the Arrays only once, what a stupid thing I have done!
b) thank you, I had the same idea but the problem is that those centroids must be updated in another step. So constant memory isn’t suitable, it’s readonly. I can put in constant A, B, C since these points are fixed data .

last thing…
In the for loop I’ve put an if( ) to check if the currentValue is < than min. Probably this solution leads to thread divergence.
Do you think that would be better to save all the distances in an array somewhere and then perform a sorting on the array to find the smallest distance (and relative index somehow) ?

Thank you again

Are the centroids updated during the minDinstance() kernel run or in another step outside? Because constant memory does not mean that it is constant for the whole runtime of your program, but only constant while the kernel runs. Thus, you can run the minDistance(), then updated the centroids in constant memory and then run the kernel again.

Concerning your last question: at some point you always need the if branch to sort the distances. In your given code the divergent branch is minimal (only one instruction) and won’t affect performance.

Cool ! I have misinterpreted the meaning of read-only.

As you realized, in minDistance I don’t write to centroids’ arrays. But in the next step I count the pixel of each centroid and sum A, B, C values for each centroid and then divide: sumOfAinCentroid[i] / count [i] to get the new A value for i centroid.
(the same with B and C) and write them .

So I can seriously think to save centroids in constant memory… and maybe A, B, C in texture memory?
Texture Memory limits are 1D=(8192), 2D=(65536, 32768), 3D=(2048, 2048, 2048)
Probably I have to modify my A, B, C arrays in order to fit in a 3D texture memory.

Thank you again

PS
Is there a way not to declare

__constant__ int A_Centroids[value]

with a runtime value?