Slow local memory, feigned constant memory. coalesced? global?

Man, I tried what you say - just using this pointer to the constant memory, without ever passing it. It complained that I can’t do so, access violation. Don’t remember the exact phrase.

Wait, can there be a pointer to the constant memory? Probably if I change it to simple array - even though it is the same pointer - it won’t complain… Should give a try.

The copying and initialization code you have there looks wrong to me. I don’t believe constant memory can be dynamically allocated like that. You can have statically declared constant variables and copy to them at run time using cudaMemcpyToSymbol, but I am pretty sure that code you posted will just wind up with your data in global memory.

Ok, I should have checked how it is actually done :) Will do.

Hope you are right and constant memory accesses can perform better.

Some things to point out:

All of your code examples are completly serial. All threads perform exactly the same thing, which is kinda pointless. Are you aware of that?


__constant__ double* constant_a;

This means that a pointer to double resides in constant memory, but the actual data is not! As a result your line

volatile_a[i][j] = constant_a[i * 16 + j];

will be compiled to something like this:

offset1:=i * 16 + j;




If you want a whole array to be in constant memory, write this instead:

__constant__ constant_a;


error_here = cudaMemcpy(constant_a, host_constant_a, size_constant_a, cudaMemcpyHostToDevice);

constant_a (as it is declared so far) has a value which resides on GPU, thus cannot be accessed directly on the host side. I am surprised the compiler did not complain about this thing. Unless you use cudaemu, you should read some garbage instead. So you move host_constant_a data to some random position in GPU memory.

Similarly &constant_a should raise a compile error. You use cudaGetSymbolAddress (or something like that) to get address of device (or constant?) variables. Alternatively you may want to use cudaCopyToSymbol (or something like that) function to store data into constant memory. Constant variables are like global variables, they are allocated upon declaration so no need to call for cudaMalloc on them.


double volatile_a[16][16];

Declares an array for each thread. Since it is big, it cannot fit into 16 (or 32) of 32-bit registers, it is stored in local memory. Physically local memory is a global memory. As a result all you do in that loop is to copy data from one place in global memory to another place in global memory!

Dear Cygnus X1,

Thank you for your helpful comments.

I will follow your path.


  1. They are serial, they are doing almost the same work because it is Monte-Carlo algorithm implementation. But the code I posted here is not the whole program, there are some other calculations going on there.


  1. That makes sense! Same as const in C++. So in order to place the data in constant I do
__constant__ double constant_a[16][16];


  1. This was taken from the NVIDIA CUDA Programming Guide, where they do
float* d_A; 

	cudaMalloc((void**)&d_A, size); 

	cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);

and it works fine for me. Sure, in the case if we have data in the constant memory, we want to get an address of it in the first place. Thank you for mentioning that.

It’s the first time I hear so.


Sure it is in local :) I think it is no doubt in the local memory, also because indexing is not static, loops are not unrolled.

Yes, all I do is copying data from constant memory space to global local memory, and then I do some operations on that local data (solve the system, to be exact, function is given in my second post here).

All answers concern constant memory so far. Any comments on local memory usage? On how it coalesced/non-coalesced? :huh:

Local memory is scoped on a per thread basis. There should be no concept of “coalesced” access for local memory – coalesced implies all threads within a half-warp accessing sequential words within a 16 word-long, 16 word aligned memory segment. That clearly can’t happen with local memory structures or arrays, or anything which is more than word length, and probably doesn’t happen even for those cases.

Local Memory is arranged by the compiler and/or run-time in memory. One can always make it coalesced and I believe this feature was introduced in some CUDA release…

I don’t doubt that, under the correct conditions, thread level reads from local memory might be serviced using 32 byte, 64 bytes, or even 128 byte transactions, but that isn’t the same thing as coalescing memory reads. Coalescing memory reads would imply that half-warp or warp sized local memory would be “striped” somehow to allow the half warp to be serviced by a single transaction, and that some magic local area access pattern at the half warp level would yield better bandwidth than others. That can’t happen, can it?

Local memory accesses are always coalesced though since they are per-thread by definition.

As for my understanding, I would prefer thinking of it as of some special way of allocating data - so that accesses are always coalesced when threads read element 0 of local data, when they read element 1 of local data, and so on. There was a thread on the forum where a guy was asking this exact question - what does coalescing of local data mean. Need to look up for this thread, probably someone from NVIDIA clarified the case.

who knows… :ermm:

If it can’t, than one could get better performance with hard-coded coalesced accesses to global memory.

Sure, but you did not copy the first line of that code snippet! You have cudaMalloc, you have cudaMemcpy, but you do not have “float *d_A;”, instead you have “constant float *constant_a” which is completly different!

Programming Guide Page 20

Programming Guide Page 106, B.2.2 constant

Has the lifetime of an application, meaning you do not allocate it manually beause it already exists!

However local memory resides in global memory and although accessing it may be faster than accessing plain global memory, it will be still much slower than accessing registers, shared or constant (upon cache hit) and I believe it will be still detected by cuda profiler as global memory access (although I am not sure about this one).

Programming Guide page 88, Local Memory

Firstly, thank you Cygnus X1 for pointing out the wrong usage of constant and providing with correct code and references to CUDA docs. That makes my understanding of the problem complete.

Yes, unfortunately registers and shared memory are scarce resources (max of 128 registers per thread and 16K of shared memory per block), so we cant just do it using them.

In cudaprof there are special counters - local load and local store (Number of local memory loads and Number of local memory stores). For global memory there are other counters, such as gst_32b, gst_64b, gst_128b (Number of 32 byte, 64 byte and 128 byte global memory store transactions) and gld_32b, gld_64b, gld_128b (Number of 32 byte, 64 byte and 128 byte global memory load transactions).

[s]I actually found the answer for the question concerning amount of global stores exceeding amount of global loads:

cudaprof help doc:


Wrong wrong wrong. Cudaprof reported LOCAL stores exceeding LOCAL loads, so it is not applicable.

Dear Sarnath,

your post tells us about how the local data is (or could be?) organized in current CUDA release. Have you found any proof for that?

Thank you, avidday and Cygnus X1, for comments on constant memory usage. Now I have almost no global memory reads, and hit counter is high.

Actually you also have a limitation of 16K registers per SM (stream multiprocessor), yet SM can handle up to 1024 threads. Therefore, to reach maximum occupancy you are limited by 16 registers per thread. While it is not always good to sacrifice performance for maximum occupancy as the gain may be smaller than what you pay, usually something like 50% occupancy is good. That gives 32 registers per thread. And this is only for machines with compute capability 1.2 or higher. Smaller cards have only 8K registers per SM. On the other hand, if you use 128 registers, you will be albe to run only 128 threads in parallel which will be slow - GPU won’t be able to hide global memory access but also some hazards with shared memory or even registers! You need at least 192 threads per SM for that if I recall correctly.

I stand corrected.


you have limit of 512 threads per block (can we run several blocks on the same multiprocessor without “context” being switched/swapped, if we can discuss the “context” term here et all).

one can never tell what occupancy will give best performance. one can’t tell if running program in 192 threads will give better performance (run faster, do more work in the same time) than running the same program in 64 threads. Especially if you have everything (buffered) in registers, shared memory, constant memory (the slowest amongst them).

We are getting a bit off-topic here. It is fact that I can’t use registers for my type of application. Instead, local memory is used. Local memory is slow (it is not cached, it is delayed by 400-600 cycles per access). But it is coalesced by default. The question is whether this default coalescing is the best one.

As long as you don’t make sacrifices (e.g. spilling excessive registers to local memory) more threads will make better performance.

Also, what I said, is that there is a change of gain steep at 192 threads. Below 192, if you increase number of threads, you gain more than the same optimisation above 192. This is because if you are below 192 there may be some hazards on register level and device has to wait idle to ensure correct behaviour. With over 192 threads this is well hidden by simply swapping execution to different warp.

Best Practices Guide, page 36, 3.2.6. Registers

For comparison standard floating point operations (e.g. addition) takes 4 cycles per warp to execute. So if you launch 64 threads an addition would probably be executed as follows:

cycle 0-3: warp 0 performs addition

cycle 4-7: warp 1 performs addition

cycle 8-23: SM remains idle

cycle 24-27: warp 0 performs next operation…

So 66% of time is wasted on every instruction! Not to mention time wasted on global memory access…

Number of threads per block has nothing to do with it. There can be several blocks running on the same SM if there is enough resources (registers and shared mem) to do so.

I used to implement this myself whenever I have local memory problems. Every thread accessing the same local variable can always be made coalesced if u arrange the local memory in a particular order… So, I raised this issue in the wish list thread… I remember tmurray or some1 from NVIDIA, conforming that this is already available from some CUDA release. I will see if I can locate that thread… Thanks!

Ok, Thanks to some browsing and pointers from Big_Mac on a previous thread – The programming guide itself has this info. Section Let me quote this for you…

looks like local memory accesses are counted the same way as global accesses are. See my post here.

Upd: Local memory accesses therefore are coalesced in my program:

if doing total of 4096 iterations per block,

with 3249 reads per iteration, 1809 writes per iteration,

8 byte values:


4096 * 3249 * 8 = 106’463’232 bytes read per block.

If coalescing to 128 bytes, total of 106’463’232 / 128 = 831744 128-byte load transactions is made per block.

Cudaprof load counter (per block) is 828416.


4096 * 1809 * 8 = 59’277’312 bytes written per block.

If coalescing to 128 bytes, total of (59’277’312 / 128 ) * 8 = 3704832 128-byte store transactions is made per block.

Cudaprof store counter (per block) is 3704832.

Thus, I have to admit that local memory is well-coalesced and therefore there is no need to use global memory instead, because results will be the same.

So, Sarnath, it can be a proof of what you said once about local memory coalescing.

Dear avidday, you sound wrong here:

And I wonder if there is any way to reduce partition camping here…