I’m writing a program with CUDA and the problem is the following:
-two matrices A (m * 128) and B (n * 128)
-i take the first row of A, and i compute the distance between that vector and all the rows of B, one by one.
i write the result of each distance on a row of a matrix C, so the element C(i,j) of C contains the distance between row i of A and row j of B.
-and i proceed with the next row of A.
with other functions then i find the minimum distance, but the problem is that the kernel above is too long.
I’ve implemented it this way=
I’ve got a grid made by ( m * (n / 1024) ) blocks
and 1024 threads per block.
( i’ve put every type of guards don’t worry)
(the matrices are already in global memory)
and i launch the following kernel:
__global__ void distance_kernel(float* d_C, float* d_A, float* d_B, int m, in n){
__shared__ float common_row[128];
//all the threads in a block have the same row of A
int bx=blockIdx.x;
int by=blockIdx.y;
int tx= threadIdx.x;
int NUM_COL=128;
//A and B have the same number of col.
if(tx < NUM_COL){
common_row[tx] = d_A[bx*NUM_COL+tx];
}
//load row of A in shared memory
__syncthreads();
//for loop for the distances : pseudocode
for(int i = 0; i< NUM_COL; i++){
d_C+ = ( common_row[i]-b[element])^2;
}
d_C=sqrt(d_C);
__syncthreads();
}
this kernel takes 4 seconds with matrices (7000128) and (4000128)
yeah this is one issue, a “C_shared” would do the trick
another important fact is, if you aren’t really looking for the euclidian distance, but just for minimun of it…why compute it totally? take the square root out. it takes 8 times longer to compute than a add/multiply and since square root doesnt change the order, it is not necessary
Use a register variable instead. There is no need to use shared memory for the accumulation, it is just a waste of resources and will be slower.
Yes. Load d_B into shared memory in a similar way to what is already being done for d_A, then have the distance calculation load only from shared memory.
Yes, but it is inevitable. By doing as I suggest you are reducing the number of writes per distance from 130 to 1. That will yield very large performance improvements. You can coalesce the writes as well, but that is probably not going to give as bigger benefit as fixing some of the wasteful global memory accesses the original code contains.
I can tell you, you should be able to do it in less than 0.4s. See the technique in this paper about speeding up Vector Quantization (to be published February 2011). Internally, it uses the Euclidean distance. The distance can be rewritten to use matrix multiplication, and then you can use cuBLAS and get a big gain. I hope to post a stand-alone example of this on my website soon.