Slow local memory, feigned constant memory. coalesced? global?

who knows… :ermm:

If it can’t, than one could get better performance with hard-coded coalesced accesses to global memory.

Sure, but you did not copy the first line of that code snippet! You have cudaMalloc, you have cudaMemcpy, but you do not have “float *d_A;”, instead you have “constant float *constant_a” which is completly different!

Programming Guide Page 20

Programming Guide Page 106, B.2.2 constant

Has the lifetime of an application, meaning you do not allocate it manually beause it already exists!

However local memory resides in global memory and although accessing it may be faster than accessing plain global memory, it will be still much slower than accessing registers, shared or constant (upon cache hit) and I believe it will be still detected by cuda profiler as global memory access (although I am not sure about this one).

Programming Guide page 88, 5.1.2.2 Local Memory

Firstly, thank you Cygnus X1 for pointing out the wrong usage of constant and providing with correct code and references to CUDA docs. That makes my understanding of the problem complete.

Yes, unfortunately registers and shared memory are scarce resources (max of 128 registers per thread and 16K of shared memory per block), so we cant just do it using them.

In cudaprof there are special counters - local load and local store (Number of local memory loads and Number of local memory stores). For global memory there are other counters, such as gst_32b, gst_64b, gst_128b (Number of 32 byte, 64 byte and 128 byte global memory store transactions) and gld_32b, gld_64b, gld_128b (Number of 32 byte, 64 byte and 128 byte global memory load transactions).

[s]I actually found the answer for the question concerning amount of global stores exceeding amount of global loads:

cudaprof help doc:

[/s]

Wrong wrong wrong. Cudaprof reported LOCAL stores exceeding LOCAL loads, so it is not applicable.

Dear Sarnath,

your post tells us about how the local data is (or could be?) organized in current CUDA release. Have you found any proof for that?

Thank you, avidday and Cygnus X1, for comments on constant memory usage. Now I have almost no global memory reads, and hit counter is high.

Actually you also have a limitation of 16K registers per SM (stream multiprocessor), yet SM can handle up to 1024 threads. Therefore, to reach maximum occupancy you are limited by 16 registers per thread. While it is not always good to sacrifice performance for maximum occupancy as the gain may be smaller than what you pay, usually something like 50% occupancy is good. That gives 32 registers per thread. And this is only for machines with compute capability 1.2 or higher. Smaller cards have only 8K registers per SM. On the other hand, if you use 128 registers, you will be albe to run only 128 threads in parallel which will be slow - GPU won’t be able to hide global memory access but also some hazards with shared memory or even registers! You need at least 192 threads per SM for that if I recall correctly.

I stand corrected.

AFAIK:

you have limit of 512 threads per block (can we run several blocks on the same multiprocessor without “context” being switched/swapped, if we can discuss the “context” term here et all).

one can never tell what occupancy will give best performance. one can’t tell if running program in 192 threads will give better performance (run faster, do more work in the same time) than running the same program in 64 threads. Especially if you have everything (buffered) in registers, shared memory, constant memory (the slowest amongst them).

We are getting a bit off-topic here. It is fact that I can’t use registers for my type of application. Instead, local memory is used. Local memory is slow (it is not cached, it is delayed by 400-600 cycles per access). But it is coalesced by default. The question is whether this default coalescing is the best one.

As long as you don’t make sacrifices (e.g. spilling excessive registers to local memory) more threads will make better performance.

Also, what I said, is that there is a change of gain steep at 192 threads. Below 192, if you increase number of threads, you gain more than the same optimisation above 192. This is because if you are below 192 there may be some hazards on register level and device has to wait idle to ensure correct behaviour. With over 192 threads this is well hidden by simply swapping execution to different warp.

Best Practices Guide, page 36, 3.2.6. Registers

For comparison standard floating point operations (e.g. addition) takes 4 cycles per warp to execute. So if you launch 64 threads an addition would probably be executed as follows:

cycle 0-3: warp 0 performs addition

cycle 4-7: warp 1 performs addition

cycle 8-23: SM remains idle

cycle 24-27: warp 0 performs next operation…

So 66% of time is wasted on every instruction! Not to mention time wasted on global memory access…

Number of threads per block has nothing to do with it. There can be several blocks running on the same SM if there is enough resources (registers and shared mem) to do so.

I used to implement this myself whenever I have local memory problems. Every thread accessing the same local variable can always be made coalesced if u arrange the local memory in a particular order… So, I raised this issue in the wish list thread… I remember tmurray or some1 from NVIDIA, conforming that this is already available from some CUDA release. I will see if I can locate that thread… Thanks!

Ok, Thanks to some browsing and pointers from Big_Mac on a previous thread – The programming guide itself has this info. Section 5.1.2.2. Let me quote this for you…

looks like local memory accesses are counted the same way as global accesses are. See my post here.

Upd: Local memory accesses therefore are coalesced in my program:

if doing total of 4096 iterations per block,

with 3249 reads per iteration, 1809 writes per iteration,

8 byte values:

Read:

4096 * 3249 * 8 = 106’463’232 bytes read per block.

If coalescing to 128 bytes, total of 106’463’232 / 128 = 831744 128-byte load transactions is made per block.

Cudaprof load counter (per block) is 828416.

Write:

4096 * 1809 * 8 = 59’277’312 bytes written per block.

If coalescing to 128 bytes, total of (59’277’312 / 128 ) * 8 = 3704832 128-byte store transactions is made per block.

Cudaprof store counter (per block) is 3704832.

Thus, I have to admit that local memory is well-coalesced and therefore there is no need to use global memory instead, because results will be the same.

So, Sarnath, it can be a proof of what you said once about local memory coalescing.

Dear avidday, you sound wrong here:

And I wonder if there is any way to reduce partition camping here…