Hi,
I’ve to implement the K-Nearest Neighbor algorithm in CUDA.
Now, I’ve a simple CUDA implementation where I compute all the distances and I get only the k-th distance.
This code works but I know that there is a more complex and faster implementation using kd-tree.
Do anyone have a KNN or kd-tree implementation in CUDA?
Thanks for your help!
Vince
Building a kd-tree on the GPU would be pretty tough, since it is a recursive and mostly serial operation. The only way I could see to exploit any parallelism in the building of the tree would be to have each kernel launch handle one level of the tree. Each block in the grid would read in its assigned leaf, split it and then write out the two new split leaves. I’m not sure of the most efficient way to do the split, but the last time I wrote a kd-tree, I recall there is way that you only need to sort by x,y,z once at the beginning… or maybe I remember incorrectly.
The recursion would be done by calling this kernel repeatedly with more blocks each time. The challenges are 1) coming up with a data structure for the tree that can be read /written coalesced. 2) Coding the kernel to handle the disparate number of particles in each leaf: at the top, one block will need to handle 1000’s and at the bottom, there are only a few
There’s a lot going on there, and a lot of wasted memory reads/writes so I’m not that confident that building a kd-tree on the GPU would be any faster than an optimized CPU version.
I am currently working on building (left)-balanced kd-tree by using GPU.
The most optimal appears to be a combination of GPU and CPU for tree-build. I sort particles at every step on GPU (via RadixSort or BitonicSort from CUDA_SDK example) and the rest is done on CPU. In this case, you offload sorting to GPU which is orders of magnitude faster than on CPU.
The tree-walk however can be done completely on GPU. The reads are random but localised, so cannot use shared memory, in principle. Via textures, I achieve bandwidth of 30GB/s.
Still experimenting with this.
Do you need to build the K-D tree on the GPU? I mean, if it’s (semi-)static you could build it in the CPU and upload it. If you do some particle simulation, then of course you need to rebuild it all the time.
Well, if the tree-build becomes a bottleneck–and construction of the balanced tree might easily be–since everything else is fast, surely it is definitely worth to offload some parts of the tree construction to GPU just speed it up. Otherwise, I agree, there is no much practical sense in doing this.
Very interesting.
I like the approach consisting in using the parallel sort. Nice idea!
I know the theory of the kd-tree construction but I’m not sure to understand how to do this in cude.
I mean, what kind of struture do you use? An array? I’ve seen some C implementation using struct but I think that it’s not possible to copy the tree in the GPU global memory…
A structure will do the job, and yes you can use this in CUDA. The problem is that it appears that this cannot be read via tex1Dfetch, so you loose advantage of caching.
However, you may want to store the tree structure in an array of int4 (split_dimension, left, right, __int_as_float(split_value)). In this way, you can use tex1Dfetch to fetch the node. Assuming you have data locality in both space and memory, you can reach pretty high bandwidth. In this way, I get about 30GB/s, but still there is a room for improvement.
Good luck.
Can you share your simple CUDA implementation of KNN without using KD trees ?
Thanks
This is my “poor” contribution. This code is for matlab.
-
Compute all the distance in parallel
-
Sort each column in parallel using Comb sort
-
Select the k-th row and compute its square root
This is very basic. If anybody have an idea to improve my code, it will be very very helpful!!!
Vince
#include <stdio.h>
#include "mex.h"
#include "cuda.h"
#define BLOCK_DIM 16
// Compute the distances between the matrix A and B.
// A : reference points
// B : query points
// wA : theorical width of A
// wB : theorical width of B
// pA : real width of A (pitch in number of column)
// pB : real width of B (pitch in number of column)
// dim : dimension = height of A and B
// C : output matrix
__global__ void
cuComputeDistance( float* A, int wA, int pA, float* B, int wB, int pB, int dim, float* C)
{
// Block index
int bx = blockIdx.x;
int by = blockIdx.y;
// Thread index
int tx = threadIdx.x;
int ty = threadIdx.y;
// Sub-matrix of A : begin, step, end
int begin_A = BLOCK_DIM * by;
int end_A = begin_A + (dim-1) * pA;
int step_A = BLOCK_DIM * pA;
// Sub-matrix of B : begin, step
int begin_B = BLOCK_DIM * bx;
int step_B = BLOCK_DIM * pB;
// Csub is used to store the element of the block sub-matrix that is computed by the thread
float Csub = 0;
// Conditions
int condA = (begin_A + ty < wA);
int condB = (begin_B + tx < wB);
// Loop over all the sub-matrices of A and B required to compute the block sub-matrix
for (int a = begin_A, b = begin_B; a <= end_A; a += step_A, b += step_B) {
// Declaration of the shared memory arrays As and Bs used to store the sub-matrix of A and B
__shared__ float shared_A[BLOCK_DIM][BLOCK_DIM];
__shared__ float shared_B[BLOCK_DIM][BLOCK_DIM];
// Load the matrices from device memory to shared memory; each thread loads one element of each matrix
if (a/pA + ty < dim){
shared_A[ty][tx] = (begin_A + tx < wA)? A[a + pA * ty + tx] : 0;
shared_B[ty][tx] = (begin_B + tx < wB)? B[b + pB * ty + tx] : 0;
}
else{
shared_A[ty][tx] = 0;
shared_B[ty][tx] = 0;
}
// Synchronize to make sure the matrices are loaded
__syncthreads();
// Compute the difference between the two matrixes; each thread computes one element of the block sub-matrix
if (condA && condB){
for (int k = 0; k < BLOCK_DIM; ++k)
Csub += (shared_A[k][ty] - shared_B[k][tx]) * (shared_A[k][ty] - shared_B[k][tx]);
}
// Synchronize to make sure that the preceding computation is done before loading two new sub-matrices of A and B in the next iteration
__syncthreads();
}
// Write the block sub-matrix to device memory; each thread writes one element
if (condA && condB){
int c = (begin_A + ty)*pB + (begin_B + tx);
C[c] = Csub;
}
}
// Sort the column of matric tab.
// tab : matrix containing distances
// width : theorical width of tab
// height : height of tab
// pitch : real width of tab (pitch in number of column)
__global__ void cuParallelCombSort(float *tab, int width, int height, int pitch){
int gap = height;
float swapped;
float temp;
float * tab_tmp;
int index_1;
int index_2;
int i;
unsigned int xIndex = blockIdx.x * BLOCK_DIM + threadIdx.x;
if (xIndex<width){
tab_tmp = &tab[xIndex];
for (;;) {
swapped = 0;
gap = gap * 10 / 13;
if (gap < 1)
gap = 1;
if (gap == 9 || gap == 10)
gap = 11;
for (i = 0; i < height - gap; i++) {
index_1 = i*pitch;
index_2 = (i+gap)*pitch;
if (tab_tmp[index_1] > tab_tmp[index_2]) {
// swap
temp = tab_tmp[index_1];
tab_tmp[index_1] = tab_tmp[index_2];
tab_tmp[index_2] = temp;
swapped = 1;
}
}
if (gap == 1 && !swapped)
break;
}
}
}
// Get the k-th value of each column and compute its square root.
// dist : matrix containing all the distances
// width : theorical width of B
// pitch : real width of A (pitch in number of column)
// k : k-th nearest neighbor
// output : the output vector
__global__ void cuParallelSqrt(float *dist, int width, int pitch, int k, float* output){
unsigned int xIndex = blockIdx.x * BLOCK_DIM * BLOCK_DIM + threadIdx.x;
if (xIndex<width)
output[xIndex] = sqrt(dist[k*pitch+xIndex]);
}
// Print error message
// e : error code
// var : name of variable
// mem_free : size of free memory in bytes
// mem_size : size of memory tryed to be allocated
void printErrorMessageAndMatlabExit(cudaError_t e, char* var, int mem_free, int mem_size){
printf("==================================================\n");
printf("ERROR MEMORY ALLOCATION : %s\n",cudaGetErrorString(e));
printf("Variable : %s\n",var);
printf("Free memory : %d\n",mem_free);
printf("Whished allocated memory : %d\n",mem_size);
printf("==================================================\n");
mexErrMsgTxt("CUDA ERROR DURING MEMORY ALLOCATION");
}
// MEX function
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[]){
// Reference points
float* ref_host;
float* ref_dev;
int ref_width;
int ref_height;
size_t ref_pitch;
cudaError_t ref_alloc;
// Reference points
float* query_host;
float* query_dev;
int query_width;
int query_height;
size_t query_pitch;
cudaError_t query_alloc;
// Output array
float* output_host;
float* output_dev;
size_t output_pitch;
cudaError_t output_alloc;
// Distance
float* dist_dev;
size_t dist_pitch;
cudaError_t dist_alloc;
// K for K-Nearest Neighbor
int k;
// Size of float
int sof = sizeof(float);
// Memory information
unsigned int mem_free, mem_total;
// CUDA Init and get memory information
cuInit(0);
cuMemGetInfo(&mem_free,&mem_total);
//printf("Free : %d\n",mem_free);
// Read input information
ref_host = (float *) mxGetPr(prhs[0]);
ref_width = mxGetM(prhs[0]);
ref_height = mxGetN(prhs[0]);
query_host = (float *) mxGetPr(prhs[1]);
query_width = mxGetM(prhs[1]);
query_height = mxGetN(prhs[1]);
k = (int)mxGetScalar(prhs[2]);
// Allocation of output array
output_host = (float *) mxGetPr(plhs[0]=mxCreateNumericMatrix(query_width,1,mxSINGLE_CLASS,mxREAL));
// Allocation CUDA memory
cuMemGetInfo(&mem_free,&mem_total);
dist_alloc = cudaMallocPitch( (void **) &dist_dev, &dist_pitch, query_width*sof, ref_width);
if (dist_alloc)
{
// Print error message and matlab exit
printErrorMessageAndMatlabExit(dist_alloc,"dist_dev", mem_free, query_width*ref_width*sof);
}
// Allocation CUDA memory
cuMemGetInfo(&mem_free,&mem_total);
ref_alloc = cudaMallocPitch( (void **) &ref_dev, &ref_pitch, ref_width*sof, ref_height);
if (ref_alloc)
{
// Free memory of var already allocated
cudaFree(dist_dev);
// Print error message and matlab exit
printErrorMessageAndMatlabExit(ref_alloc,"dist_dev", mem_free, ref_width*ref_height*sof);
}
// Allocation CUDA memory
cuMemGetInfo(&mem_free,&mem_total);
query_alloc = cudaMallocPitch( (void **) &query_dev, &query_pitch, query_width*sof, query_height);
if (query_alloc)
{
// Free memory of var already allocated
cudaFree(dist_dev);
cudaFree(ref_dev);
// Print error message and matlab exit
printErrorMessageAndMatlabExit(query_alloc,"dist_dev", mem_free, query_width*query_height*sof);
}
// Allocation CUDA memory
cuMemGetInfo(&mem_free,&mem_total);
output_alloc = cudaMallocPitch( (void **) &output_dev, &output_pitch, query_width*sof, 1);
if (output_alloc)
{
// Free memory of var already allocated
cudaFree(dist_dev);
cudaFree(ref_dev);
cudaFree(query_dev);
// Print error message and matlab exit
printErrorMessageAndMatlabExit(output_alloc,"dist_dev", mem_free, query_width*sof);
}
// Copy host to device
cudaMemcpy2D(ref_dev , ref_pitch , ref_host , ref_width*sof , ref_width*sof , ref_height , cudaMemcpyHostToDevice);
cudaMemcpy2D(query_dev , query_pitch , query_host , query_width*sof , query_width*sof , query_height , cudaMemcpyHostToDevice);
// Compute square distances
dim3 grid(query_width/BLOCK_DIM, ref_width/BLOCK_DIM, 1);
dim3 threads(BLOCK_DIM,BLOCK_DIM,1);
if (query_width%BLOCK_DIM !=0) grid.x+=1;
if (ref_width%BLOCK_DIM !=0) grid.y+=1;
cuComputeDistance<<<grid,threads>>>( ref_dev, ref_width, ref_pitch/sof, query_dev, query_width, query_pitch/sof, ref_height, dist_dev );
// Sort each row in parallel
dim3 grid2(query_width/BLOCK_DIM,1,1);
dim3 threads2(BLOCK_DIM,1,1);
if (query_width%BLOCK_DIM !=0) grid2.x+=1;
cuParallelCombSort<<<grid2,threads2>>>(dist_dev, query_width, ref_width, dist_pitch/sof);
// Compute sqrt
dim3 grid3(query_width/(BLOCK_DIM*BLOCK_DIM),1,1);
dim3 threads3(BLOCK_DIM*BLOCK_DIM,1,1);
if (query_width%(BLOCK_DIM*BLOCK_DIM) !=0) grid3.x+=1;
cuParallelSqrt<<<grid3,threads3>>>(dist_dev, query_width, query_pitch/sof, k-1, output_dev);
// Copy memory
cudaMemcpy2D(output_host, query_width*sof, output_dev, output_pitch, query_width*sof, 1, cudaMemcpyDeviceToHost);
// Free memory
cudaFree(dist_dev);
cudaFree(ref_dev);
cudaFree(query_dev);
cudaFree(output_dev);
}
ok, now Kun Zhou siggraph08’s kd-tree construction surpasses all huh? i’ll try more.
Error using mex
Creating library kernel.lib and object kernel.exp
kernel.obj : error LNK2019: unresolved external symbol cuInit
referenced in function mexFunction
kernel.obj : error LNK2019: unresolved external symbol cuMemGetInfo_v2
referenced in function mexFunction
kernel.mexw64 : fatal error LNK1120: 2 unresolved externals
Error in mexcuda (line 157)
[varargout{1:nargout}] = mex(mexArguments{:});
i run the same code nut facing these error any one help me??
I also want to implement k-nearest neighbor searching on 3-dimensional data using matlab and cuda GPU based
i am using matlab 2015b with cuda7.0
Make sure you explicitly include libcuda and libcudart in your link options… e.g.
-lcuda -lcudart