Optimising Device RAM by using Shared Cache Trying to optimise the global memory access by using sha

For a university course, I’m trying to make a program that does a string match against a “database” and then for the strings that did match, I have to perform some computation (distance basically).

The “database” contains categories (a 256 fix string) and for each coordinates (2 floats). The idea is to find categories that are matching a user choice and then to find the closest “place of interest” from where we stand.

So my idea is to copy from the host memory to the device memory and array of structure, with the structure map like this in the device memory
struct database_row align(32) {
char4 category[64]; // 64*4 = 256
float2 coordinates;
(if I’m wrong, just tell me)
I’m align-ing the structure on 32bit so that global memory access are coalescent (at least that what I understood from the programmer’s guide).
Now, I could read 128byte per half-wrap and compare it to the user chosen category that I want to keep in the constant cache. And if a string match, then I use the coordinates to compute the distance from my current location.

However, access to global memory (eventhough and hopefully coalescent) is slow (400-600 cycles so they say in the docs). Therefore, my technic is not efficient. I feel like I could optimise things by using the shared cache, but I’m at complete loss. I feel it’s something about blocks (and their dimension), but I can’t make it.

So any help would be appreciated, from pointers to a tutorial or documentation (other than the best practice and programming guides that I have already read) to another forum thread or even a start of a solution.