To get more familiar with CUDA, I’m working on implementing a brute force KNN search.
I’ve seen both the Fast k Nearest Neighbor Search using GPU paper and the K-Nearest Neighbor, Implementation in CUDA thread.
I’m working on a data set that has 100 training points, 100 testing points and 60 attributes.
Each thread is currently responsible for computing the k nearest distances for a specified testing point.
In order to do this, each thread needs to iterate over all training points. Since all threads need to access the training points, I wanted to copy them into shared memory.
However, I have 100 training points, each with 60 floating point attributes which results in 4 bytes * 100 * 60 = 24,000 bytes which is more than the available shared correct?
I’m wondering if I would be best off using global/texture memory to store the training points? Or if I should stick with shared and just have the host call the kernel multiple times with fewer training points with a cudaThreadSynchronize between them?
Thank you for the assistance,
-Jesse