Too much data for shared - Use global or call kernel multiple times?

To get more familiar with CUDA, I’m working on implementing a brute force KNN search.

I’ve seen both the Fast k Nearest Neighbor Search using GPU paper and the K-Nearest Neighbor, Implementation in CUDA thread.

I’m working on a data set that has 100 training points, 100 testing points and 60 attributes.

Each thread is currently responsible for computing the k nearest distances for a specified testing point.

In order to do this, each thread needs to iterate over all training points. Since all threads need to access the training points, I wanted to copy them into shared memory.

However, I have 100 training points, each with 60 floating point attributes which results in 4 bytes * 100 * 60 = 24,000 bytes which is more than the available shared correct?

I’m wondering if I would be best off using global/texture memory to store the training points? Or if I should stick with shared and just have the host call the kernel multiple times with fewer training points with a cudaThreadSynchronize between them?

Thank you for the assistance,
-Jesse

I apologize, I guess I had too many tabs open and accidentally placed this message in “General CUDA GPU Computing Discussion” when it should have gone in “CUDA Programming and Development”