Global memory caching

Hello,

Given that global memory accesses are cached for devices of sufficient compute capability, if I pre-fetch data in global memory, by loading it to shared memory, sufficiently long before using the data now stored in shared memory, this would likely result in a cache hit…

Or am I mistaken?

Something like:

shared_memory_variable = global_memory_variable

[code using neither; of sufficient time span]

[code eventually using shared_memory_variable]

I’m confused by your example. You only read global memory once, so there is no cache hit, write to shared memory, and then read from shared memory. Shared memory is not cached (doesn’t need to be, since it is basically as fast as L1). Do you have something else in mind?

Seibert: thanks for the reply

The loading to shared memory part really assumes that this would aid in a cache hit, given that that loaded to shared memory is not used immediately; the intrinsic of global memory caching is not really elaborately discussed, hence I was hoping that this might hint the compiler somehow

Perhaps I should rephrase as such: given that global memory is cached for devices of sufficient compute capability, are there any cache instructions or methods in general to aid cache hits

My global memory accesses are rather conditional, such that it is sure to result in cache misses when execution gets to the global memory read points
However, my algorithm/ kernel is sufficiently large that global memory “pre-fetching” becomes sensible
Some times with a little overhead, and some times with no overhead, I can pre-determine - or simply know in advance - global memory reads down the line, with addresses known
Is there any way to “manipulate” or “aid” global memory caching under these conditions, to increase cache hits? Put differently, can global memory caching in any way imply global memory “pre-fetching”?

For some reason I do not believe that __1dg() will allow me to pre-fetch global memory; am I mistaken - will __1dg() manage?

Consider this:

kernel
{
function_A();

function_B();

function_C();

function_D();

}

Prior to function B, you already know that function D will need global memory variable x; the ideal would be to not have function D endure global memory latency to obtain x

Also, function C requires 3 (any number above 1 really) global memory variables - g, h, j - to complete execution; here the ideal would be to have global memory latency compare to/ approach that of a single global memory variable, and not that of 3 global memory variables (hopefully by having the global memory requests in close succession, without necessarily having to wait for any one to finish before the next can start)

I shall try to achieve this with __1dg(); no idea whether it will work

I think the problem you will run into is that the cache is not very large compared to the number of threads, so it is unlikely that the variable x will still be in the cache by the time you get to function D. This is a place where shared memory is useful, since you manage it explicitly.

Does the lifetime of variables loaded into cache by __1dg() compare to that of ordinary variables?
When are variables loaded into cache by __1dg() destroyed?

Shared memory is not large enough to store all of the algorithm or kernel’s data, thus I am compelled to use global memory as well

The global memory required by individual functions is sufficiently small such that these can be buffered in shared memory, for the duration of the individual function
A number of of these individual functions require multiple global memory variables, and according to my calculations, this will result in multiple global memory requests, all in series: hit global memory request A, fetch global memory A, wait for global memory A, move on a few instructions until another global memory request B; again fetch global memory B, wait for global memory B…
If I can at least use caching to fetch all global memory required within the function in close sequence: cache global memory A, B, C… execute function