Optimising Device RAM by using Shared Cache Trying to optimise the global memory access by using sha

huygens_25 · February 24, 2010, 8:37pm

For a university course, I’m trying to make a program that does a string match against a “database” and then for the strings that did match, I have to perform some computation (distance basically).

The “database” contains categories (a 256 fix string) and for each coordinates (2 floats). The idea is to find categories that are matching a user choice and then to find the closest “place of interest” from where we stand.

So my idea is to copy from the host memory to the device memory and array of structure, with the structure map like this in the device memory
struct database_row align(32) {
char4 category[64]; // 64*4 = 256
float2 coordinates;
}
(if I’m wrong, just tell me)
I’m align-ing the structure on 32bit so that global memory access are coalescent (at least that what I understood from the programmer’s guide).
Now, I could read 128byte per half-wrap and compare it to the user chosen category that I want to keep in the constant cache. And if a string match, then I use the coordinates to compute the distance from my current location.

However, access to global memory (eventhough and hopefully coalescent) is slow (400-600 cycles so they say in the docs). Therefore, my technic is not efficient. I feel like I could optimise things by using the shared cache, but I’m at complete loss. I feel it’s something about blocks (and their dimension), but I can’t make it.

So any help would be appreciated, from pointers to a tutorial or documentation (other than the best practice and programming guides that I have already read) to another forum thread or even a start of a solution.

Topic		Replies	Views
Shared memory doubt CUDA Programming and Performance	5	4667	June 11, 2008
performance for global and shared memory CUDA Programming and Performance	2	6285	January 15, 2008
Memory management issues Global and Shared memory management CUDA Programming and Performance	12	4045	March 2, 2009
Local vs Shared Memory execution slows down when using shared memory CUDA Programming and Performance	6	3303	October 14, 2009
about shared memory's contribution to performance when global memory access is coalesced CUDA Programming and Performance	0	626	July 12, 2011
Using Shared Memory in CUDA Fortran Technical Blog	0	420	August 25, 2020
Correct Use of Shared Memory? CUDA Programming and Performance	1	745	January 6, 2010
Memory coalescing in one thread CUDA Programming and Performance	17	16808	March 31, 2011
about shared memory's contribution to performance when global memory access is coalesced CUDA Programming and Performance	3	3574	July 12, 2011
Lookup table in global memory Can I coallesce accesses? CUDA Programming and Performance	3	1366	May 14, 2009

Optimising Device RAM by using Shared Cache Trying to optimise the global memory access by using sha

Related topics