About LDU

I have a program about numerical calculation, it looks like this:
// Initialization of Out1, Out2, …
for ( ; ; ) {
Out1 = f1(a, b, …, Out1, Out2, …);
Out2 = f2(a, b, …, Out1, Out2, …);

if (fout(Out1, Out2, …) == 0)
Here a, b, … are threads-independence (all threads access the same addresses), Out1, Out2, … are threads-dependence.
All the variables are double, and they are too many to be put to shared memories.
So my questions is:

  1. Should all the variables be put in global memories?
  2. Is there a good way to make the access of variable a, b, etc faster (for example, putting them to some caches)?
    From the manual, texture and surface memories didn’t work for it. Maybe LDU can work, but in the original codes, there are too many equations, did I need to manually re-write all the equations one by one? Or there is some automatic solution?


I dont fully understand what you are asking.
However I wouldn’t worry about speed of access to a and b very much as the reads will be coalesced.

LDU (load uniform) is a Fermi-specific instruction. It requires that

(a) The data accessed is strictly read-only thoughout the entire kernel
(b) All threads in a warp access the same location, for any access to the data

If the compiler can establish that these requirements are fulfilled, it will be able to generate LDU, thus allowing the data to be read via the constant cache. The compiler in general has a single threaded view of he world. It therefore needs to change condition (b) to “all threads in the thread block must access the same location”. This means that any index for accessing the data may not be based, directly or indirectly, on threadIdx. As for condition (a), simply marking the data const is not sufficient, because the data may be accessed through a different path on which it is writable. To allow the compiler to establish that (a) holds, pointer arguments to the kernel must therefore be non-aliased, which the programmer asserts by making the pointer restricted pointers. Your kernel must therefore look something like this:

__global__ void foo (T scalar_arg0, ..., 
                     T scalar_argN, 
                     const T * __restrict__ read_only_ptr0, ...,
                     const T * __restrict__ read_only_ptrN,
                     T * __restrict__ read_write_ptr0, ...,
                     T * __restrict__ read_write_ptrN);

In other words, all pointer arguments must use restricted pointers which requires that there is no aliasing between pointers, meaning that each pointer must point to a separate, non-overlapping, area of memory. All pointers to read-only data must indicate that by use of const. You may need to apply the same treatment to device functions called inside the kernel. Using const and restrict is good practice in general as it supplies the maximum amount of information to the compiler.

Once you have const and restrict in place, you should see LDU generated as long as the indexing into read-only arrays does not depend in any way on thread index. You can check the generated SASS machine code by running cuobjdump --dump-sass on the resulting executable, and searching for LDU. Since LDU is a Fermi-specific instruction you would want to look at the sm_20 code only in case you are building fat binaries.

[Sorry for the poor formtting of the code snippet, it seems there is no code-formatting option yet in the new forums, I will check with the web team on that. Never mind, code markup is available, it is just that the symbol for it in the task bar (rightmost) is so small and faint I overlooked it]