K-Nearest Neighbor Implémentation in CUDA

I’ve to implement the K-Nearest Neighbor algorithm in CUDA.
Now, I’ve a simple CUDA implementation where I compute all the distances and I get only the k-th distance.
This code works but I know that there is a more complex and faster implementation using kd-tree.
Do anyone have a KNN or kd-tree implementation in CUDA?
Thanks for your help!

Building a kd-tree on the GPU would be pretty tough, since it is a recursive and mostly serial operation. The only way I could see to exploit any parallelism in the building of the tree would be to have each kernel launch handle one level of the tree. Each block in the grid would read in its assigned leaf, split it and then write out the two new split leaves. I’m not sure of the most efficient way to do the split, but the last time I wrote a kd-tree, I recall there is way that you only need to sort by x,y,z once at the beginning… or maybe I remember incorrectly.

The recursion would be done by calling this kernel repeatedly with more blocks each time. The challenges are 1) coming up with a data structure for the tree that can be read /written coalesced. 2) Coding the kernel to handle the disparate number of particles in each leaf: at the top, one block will need to handle 1000’s and at the bottom, there are only a few

There’s a lot going on there, and a lot of wasted memory reads/writes so I’m not that confident that building a kd-tree on the GPU would be any faster than an optimized CPU version.

I am currently working on building (left)-balanced kd-tree by using GPU.

The most optimal appears to be a combination of GPU and CPU for tree-build. I sort particles at every step on GPU (via RadixSort or BitonicSort from CUDA_SDK example) and the rest is done on CPU. In this case, you offload sorting to GPU which is orders of magnitude faster than on CPU.

The tree-walk however can be done completely on GPU. The reads are random but localised, so cannot use shared memory, in principle. Via textures, I achieve bandwidth of 30GB/s.

Still experimenting with this.

Do you need to build the K-D tree on the GPU? I mean, if it’s (semi-)static you could build it in the CPU and upload it. If you do some particle simulation, then of course you need to rebuild it all the time.

Well, if the tree-build becomes a bottleneck–and construction of the balanced tree might easily be–since everything else is fast, surely it is definitely worth to offload some parts of the tree construction to GPU just speed it up. Otherwise, I agree, there is no much practical sense in doing this.

Very interesting.
I like the approach consisting in using the parallel sort. Nice idea!
I know the theory of the kd-tree construction but I’m not sure to understand how to do this in cude.
I mean, what kind of struture do you use? An array? I’ve seen some C implementation using struct but I think that it’s not possible to copy the tree in the GPU global memory…

A structure will do the job, and yes you can use this in CUDA. The problem is that it appears that this cannot be read via tex1Dfetch, so you loose advantage of caching.

However, you may want to store the tree structure in an array of int4 (split_dimension, left, right, __int_as_float(split_value)). In this way, you can use tex1Dfetch to fetch the node. Assuming you have data locality in both space and memory, you can reach pretty high bandwidth. In this way, I get about 30GB/s, but still there is a room for improvement.

Good luck.

Can you share your simple CUDA implementation of KNN without using KD trees ?


This is my “poor” contribution. This code is for matlab.

  1. Compute all the distance in parallel

  2. Sort each column in parallel using Comb sort

  3. Select the k-th row and compute its square root

This is very basic. If anybody have an idea to improve my code, it will be very very helpful!!!


#include <stdio.h>

#include "mex.h"

#include "cuda.h"

#define BLOCK_DIM 16

// Compute the distances between the matrix A and B.

// A   : reference points

// B   : query points

// wA  : theorical width of A

// wB  : theorical width of B

// pA  : real width of A (pitch in number of column)

// pB  : real width of B (pitch in number of column)

// dim : dimension = height of A and B

// C   : output matrix

__global__ void

cuComputeDistance( float* A, int wA, int pA, float* B, int wB, int pB, int dim,  float* C)


    // Block index

    int bx = blockIdx.x;

    int by = blockIdx.y;


    // Thread index

    int tx = threadIdx.x;

    int ty = threadIdx.y;


    // Sub-matrix of A : begin, step, end

    int begin_A = BLOCK_DIM * by;

    int end_A   = begin_A + (dim-1) * pA;

    int step_A  = BLOCK_DIM * pA;


    // Sub-matrix of B : begin, step

    int begin_B = BLOCK_DIM * bx;

    int step_B  = BLOCK_DIM * pB;


    // Csub is used to store the element of the block sub-matrix that is computed by the thread

    float Csub = 0;


    // Conditions

    int condA = (begin_A + ty < wA);

    int condB = (begin_B + tx < wB);


    // Loop over all the sub-matrices of A and B required to compute the block sub-matrix

    for (int a = begin_A, b = begin_B; a <= end_A; a += step_A, b += step_B) {


        // Declaration of the shared memory arrays As and Bs used to store the sub-matrix of A and B

        __shared__ float shared_A[BLOCK_DIM][BLOCK_DIM];

        __shared__ float shared_B[BLOCK_DIM][BLOCK_DIM];


        // Load the matrices from device memory to shared memory; each thread loads one element of each matrix

        if (a/pA + ty < dim){

            shared_A[ty][tx] = (begin_A + tx < wA)? A[a + pA * ty + tx] : 0;

            shared_B[ty][tx] = (begin_B + tx < wB)? B[b + pB * ty + tx] : 0;



            shared_A[ty][tx] = 0;

            shared_B[ty][tx] = 0;



        // Synchronize to make sure the matrices are loaded



        // Compute the difference between the two matrixes; each thread computes one element of the block sub-matrix

        if (condA && condB){

            for (int k = 0; k < BLOCK_DIM; ++k)

                Csub += (shared_A[k][ty] -  shared_B[k][tx]) * (shared_A[k][ty] -  shared_B[k][tx]);



        // Synchronize to make sure that the preceding computation is done before loading two new sub-matrices of A and B in the next iteration




    // Write the block sub-matrix to device memory; each thread writes one element

    if (condA && condB){

        int c = (begin_A + ty)*pB + (begin_B + tx);

        C[c] = Csub;



// Sort the column of matric tab.

// tab    : matrix containing distances

// width  : theorical width of tab

// height : height of tab

// pitch  : real width of tab (pitch in number of column)

__global__ void cuParallelCombSort(float *tab, int width, int height, int pitch){


    int     gap = height;

    float   swapped;

    float   temp;

    float * tab_tmp;

    int     index_1;

    int     index_2;

    int     i;


    unsigned int xIndex = blockIdx.x * BLOCK_DIM + threadIdx.x;


    if (xIndex<width){


        tab_tmp = &tab[xIndex];


        for (;;) {

            swapped = 0;


            gap = gap * 10 / 13;

            if (gap < 1)

                gap = 1;

            if (gap == 9 || gap == 10)

                gap = 11;



            for (i = 0; i < height - gap; i++) {

                index_1 = i*pitch;

                index_2 = (i+gap)*pitch;

                if (tab_tmp[index_1] > tab_tmp[index_2]) {

                    // swap

                    temp = tab_tmp[index_1];

                    tab_tmp[index_1] = tab_tmp[index_2];

                    tab_tmp[index_2] = temp;

                    swapped = 1;




            if (gap == 1 && !swapped)





// Get the k-th value of each column and compute its square root.

// dist   : matrix containing all the distances

// width  : theorical width of B

// pitch  : real width of A (pitch in number of column)

// k      : k-th nearest neighbor

// output : the output vector

__global__ void cuParallelSqrt(float *dist, int width, int pitch, int k, float* output){

    unsigned int xIndex = blockIdx.x * BLOCK_DIM * BLOCK_DIM + threadIdx.x;

    if (xIndex<width)

        output[xIndex] = sqrt(dist[k*pitch+xIndex]);


// Print error message

// e        : error code

// var      : name of variable

// mem_free : size of free memory in bytes

// mem_size : size of memory tryed to be allocated

void printErrorMessageAndMatlabExit(cudaError_t e, char* var, int mem_free, int mem_size){


    printf("ERROR MEMORY ALLOCATION  : %s\n",cudaGetErrorString(e));

    printf("Variable                 : %s\n",var);

    printf("Free memory              : %d\n",mem_free);

    printf("Whished allocated memory : %d\n",mem_size);




// MEX function

void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[]){


    // Reference points

    float*      ref_host;

    float*      ref_dev;

    int         ref_width;

    int         ref_height;

    size_t      ref_pitch;

    cudaError_t ref_alloc;


    // Reference points

    float*      query_host;

    float*      query_dev;

    int         query_width;

    int         query_height;

    size_t     query_pitch;

    cudaError_t query_alloc;


    // Output array

    float*     output_host;

    float*     output_dev;

    size_t     output_pitch;

    cudaError_t output_alloc;


    // Distance

    float*     dist_dev;

    size_t     dist_pitch;

    cudaError_t dist_alloc;


    // K for K-Nearest Neighbor

    int k;


    // Size of float

    int sof = sizeof(float);


    // Memory information

    unsigned int mem_free, mem_total;


    // CUDA Init and get memory information



    //printf("Free : %d\n",mem_free);


    // Read input information

    ref_host        = (float *) mxGetPr(prhs[0]);

    ref_width       = mxGetM(prhs[0]);

    ref_height      = mxGetN(prhs[0]);

    query_host      = (float *) mxGetPr(prhs[1]);

    query_width     = mxGetM(prhs[1]);

    query_height    = mxGetN(prhs[1]);

    k               = (int)mxGetScalar(prhs[2]);


    // Allocation of output array

    output_host = (float *) mxGetPr(plhs[0]=mxCreateNumericMatrix(query_width,1,mxSINGLE_CLASS,mxREAL));


    // Allocation CUDA memory


    dist_alloc = cudaMallocPitch( (void **) &dist_dev, &dist_pitch, query_width*sof, ref_width);

    if (dist_alloc)


        // Print error message and matlab exit

        printErrorMessageAndMatlabExit(dist_alloc,"dist_dev", mem_free, query_width*ref_width*sof);



    // Allocation CUDA memory


    ref_alloc = cudaMallocPitch( (void **) &ref_dev, &ref_pitch, ref_width*sof, ref_height);

    if (ref_alloc)


        // Free memory of var already allocated


        // Print error message and matlab exit

        printErrorMessageAndMatlabExit(ref_alloc,"dist_dev", mem_free, ref_width*ref_height*sof);



    // Allocation CUDA memory


    query_alloc = cudaMallocPitch( (void **) &query_dev, &query_pitch, query_width*sof, query_height);

    if (query_alloc)


        // Free memory of var already allocated



        // Print error message and matlab exit

        printErrorMessageAndMatlabExit(query_alloc,"dist_dev", mem_free, query_width*query_height*sof);



    // Allocation CUDA memory


    output_alloc = cudaMallocPitch( (void **) &output_dev, &output_pitch, query_width*sof, 1);

    if (output_alloc)


        // Free memory of var already allocated




        // Print error message and matlab exit

        printErrorMessageAndMatlabExit(output_alloc,"dist_dev", mem_free, query_width*sof);



    // Copy host to device

    cudaMemcpy2D(ref_dev   , ref_pitch    , ref_host    , ref_width*sof   ,  ref_width*sof  , ref_height    , cudaMemcpyHostToDevice);

    cudaMemcpy2D(query_dev , query_pitch  , query_host  , query_width*sof , query_width*sof , query_height  , cudaMemcpyHostToDevice);


    // Compute square distances

    dim3 grid(query_width/BLOCK_DIM, ref_width/BLOCK_DIM, 1);

    dim3 threads(BLOCK_DIM,BLOCK_DIM,1);

    if (query_width%BLOCK_DIM !=0) grid.x+=1;

    if (ref_width%BLOCK_DIM   !=0) grid.y+=1;

    cuComputeDistance<<<grid,threads>>>( ref_dev, ref_width, ref_pitch/sof, query_dev, query_width, query_pitch/sof, ref_height, dist_dev );

   // Sort each row in parallel

    dim3 grid2(query_width/BLOCK_DIM,1,1);

    dim3 threads2(BLOCK_DIM,1,1);

    if (query_width%BLOCK_DIM !=0) grid2.x+=1;

    cuParallelCombSort<<<grid2,threads2>>>(dist_dev, query_width, ref_width, dist_pitch/sof);

   // Compute sqrt

    dim3 grid3(query_width/(BLOCK_DIM*BLOCK_DIM),1,1);

    dim3 threads3(BLOCK_DIM*BLOCK_DIM,1,1);

    if (query_width%(BLOCK_DIM*BLOCK_DIM) !=0) grid3.x+=1;

    cuParallelSqrt<<<grid3,threads3>>>(dist_dev, query_width, query_pitch/sof, k-1, output_dev);


    // Copy memory

    cudaMemcpy2D(output_host, query_width*sof, output_dev, output_pitch, query_width*sof, 1, cudaMemcpyDeviceToHost);


    // Free memory






ok, now Kun Zhou siggraph08’s kd-tree construction surpasses all huh? i’ll try more.

Error using mex
Creating library kernel.lib and object kernel.exp
kernel.obj : error LNK2019: unresolved external symbol cuInit
referenced in function mexFunction
kernel.obj : error LNK2019: unresolved external symbol cuMemGetInfo_v2
referenced in function mexFunction
kernel.mexw64 : fatal error LNK1120: 2 unresolved externals

Error in mexcuda (line 157)
[varargout{1:nargout}] = mex(mexArguments{:});
i run the same code nut facing these error any one help me??

I also want to implement k-nearest neighbor searching on 3-dimensional data using matlab and cuda GPU based
i am using matlab 2015b with cuda7.0

Make sure you explicitly include libcuda and libcudart in your link options… e.g.

-lcuda -lcudart