Constant memory accesses by a threads of a half warp

I need to minimize accesses to the global memory by the use of constant memory. I need to know what would be the total access time required in order that from a half warp 12 threads access the same constant memory location and the other 4 threads access any other constant memory location. Please give explanation in terms of global memory access time for a single precision floating point number.

Side remark: To my recollection half-warp is only a meaningful concept for sm_1x devices, for sm_20 and later architectures the warp is the relevant unit of execution.

Constant memory is basically a mapped section of global memory. An access to constant memory that misses the (small) constant cache will be equivalent to a global memory access. The constant cache is primarily designed for broadcast traffic where each thread in a warp receives the same data.

If different constant memory locations are accessed across the threads in a warp, the constant cache must be accessed as many times as there are different addresses across the warp. In your example of two different addresses, the cache must be accessed twice.

So there is serialization in such a case and parallelism is lost. As long as a sufficient number of threads are running to cover basic latencies, a constant memory access that hits the constant cache has a cost that is similar to a register access. Based on experiments I performed years ago, constant cache access that hits in the cache and was serialized up to degree 3 or 4 was well tolerated from a performance standpoint (compared to putting the data in shared memory, for example).

Since cache behavior is typically influenced by dynamic context, it may be best to prototype various design choices as to where to place the data. Other choices could be shared memory, textures, or global memory in conjunction with the LDG instruction if on sm_35 or higher (see __ldg() intrinsic).

I know at least on Maxwell that a constant access is the same latency as a register access (none), so long as the value is in the constant cache. Constant cache lines are of size 256 bytes so as soon as you request any value in a 256 byte block, that block gets loaded from global (at high latency) and any subsequent accesses in that block is then free. Though there may be a sub cache level of 64 bytes where the accesses are actually free (verses a few extra clocks), but it’s not really that important.

I forget what the total size of the constant cache is.

So in short: use constants whenever you can to save on register allocations. If the the register count is important enough (for occupancy), then having a divergent warp might be worth the cost.

@njuffa and scottgray, I am understanding that the access request for any other constant memory access would get serialized. But I am not getting the effective time (approx.) it would take in the following scenario:
Consider thread-block1 in which warp1 accesses constant memory location FF22, 24 threads of warp2 access the same location FF22 and then the other 8 threads of warp2 access location FF23.

At the same time, consider thread-block2 in which warp1 accesses constant memory location FF24, 26 threads of warp2 access the same location FF24 and then the other 6 threads of warp2 access location FF25.

What would be effective access time (considering any serialization) in the above case? Also please clarify me on the mapping between the constant cache and any particular thread-block (in terms of amount of constant cache per block)?

I am relatively new to CUDA programming and have almost null exposure to constant memory paradigms.

I believe it’s still 64 KB but the documentation is a bit obtuse on this.

AmanSinha: The constant cache (like the texture cache) I believe is shared between blocks (unlike shared memory). So only one warp will need to suffer the initial high latency of loading a cache line.

As far as what you should be seeing performance wise… it’s really hard to say without looking at the cuobjdump -sass output to see what the compiler is actually doing. My comments about warp divergence were assuming the ability to make assembly level optimizations.

That would only be true if one block is sufficiently far ahead in execution to cover the entire latency. If they follow each other closely, both will have to wait for the data to arrive from global memory.

Are there any differences between the new Maxwell generation GPUs and Kepler when is come down the __ldg() loads?

CudaaduC: I’m not aware of any, except that the texture cache is now the L1 cache for maxwell and serves as the coalescing buffer for global. So that may imply some subtle changes for LDG.

tera: good point. Though there are two different ways in which constants can be loaded. The first is directly as one of the operands to an instruction. This bypasses the need to store the value first in a register. Though it will cause a long stall for warp/s if the value isn’t in cache. The data in this format must be 4 bytes in size.

The second is with the LDC instruction which loads the data into registers first. These requests can be 1 to 16 bytes in size and wont stall the warp until the loaded registers are needed as operands in other instructions. Here’s info on how that’s coordinated:

[url]https://code.google.com/p/maxas/wiki/ControlCodes[/url]

Re: The data in this format must be 4 bytes in size

I assume you are describing Maxwell (which I have no experience with)? Kepler (sm_3x) also supports constant references for 8-byte data in double-precision instructions. Example from a call to DP exp():

/0108/ DFMA R12, R8, c[0x2][0x18], R12; /* 0x5b803040031c2032 /
/0110/ DFMA R12, R12, R8, c[0x2][0x20]; /
0x9b802040041c3032 /
/0118/ DFMA R12, R12, R8, c[0x2][0x28]; /
0x9b802040051c3032 /
/0120/ DFMA R12, R12, R8, c[0x2][0x30]; /
0x9b802040061c3032 /
/0128/ DFMA R12, R12, R8, c[0x2][0x38]; /
0x9b802040071c3032 /
/0130/ DFMA R12, R12, R8, c[0x2][0x40]; /
0x9b802040081c3032 /
/0138/ DFMA R12, R12, R8, c[0x2][0x48]; /
0x9b802040091c3032 */

Right, I should have said whatever size the type of instruction is expecting, 4 or 8 bytes. And for the memory loads (LDC) 1 to 16 bytes.