const __restrict__ read faster than __constant__ ?

More than once I have noticed that when I have a fixed size read-only array which is small enough to fit in constant memory, it instead seems to be read faster when accessed via const T* restrict Arr (via parameter in kernel call).

This is for the Tesla K20c… Would there be any reason why this is the case?

At a minimum it is not faster to use constant T *Arr for the same, at least not on my machine.
odd.

Are your array accesses uniform across the warp, in other words, do all threads in a warp access the same address? The constant cache has a single port that broadcasts data to each thread in a warp. If more than one address is accessed across the warp, the cache access needs to be repeated as many times as there are different addresses. In other words, these accesses become serialized.

Use of const restrict pointers facilitates (does not guarantee!) the use of the LDG instruction that was introduced in Kepler. The LDG instruction is like a global load, except that data is transported through the texture cache instead of the regualar L1/L2 cache hierarchy. Since the texture cache is not coherent, only data that is read-only and is not modified for the entire duration of the kernel execution can be read via LDG.

You can use cuobjdump --dump-sass to check for the presence of LDG instructions in the machine code (SASS).

Dear aduC,
Could I support njuffa. In my experience in practice constant memory is surprisingly
slow and difficult to use. See previous discussion
https://devtalk.nvidia.com/default/topic/479633/constant-memory-64kb-only-8kb-usable-/
Bill
http://www.cs.ucl.ac.uk/staff/W.Langdon/

In my current situation there is a small long long array which caches the powers of 7 up to 7^17. Each thread reads multiple times from this array, though each will read a subset of that array which will not be the same across the warp.

This specific code takes hours to run, so I will test both methods and report my results.

You might want to consider yet another variant where the required powers are computed on the fly. With the maximum power being 17 it should take at most five steps. It is hard to predict whether that would be faster (64-bit integers are emulated), but a general observation is that with modern GPUs, arithmetic is often “too cheap to meter” compared to memory accesses, even when cached.

My first run actually only took 47 minutes, which was less than I expected given the huge number of calculations.

The problem I solved was one where you have a 4x4 grid with numbers from 0 to 6 inclusive. The goal is to determine which board arrangement results in the most rows+columns+diagonals which sum to exactly ten.

So there are 7^16 possible arrangements, but I have to generate each in local memory, determine the score, cache the warp best results, cache the block best results, reduce scan etc.

This was my result board:

0 0 5 5
1 6 0 3
4 0 4 2
5 4 1 0

And here is a pastebin of the beta-version first try code:

http://pastebin.com/vETn9cPY

Note: that will be only posted for 24 hours

I used techniques discussed by many of the members of this board, and think it is correct code, but who knows?

My next version will do exactly as njuffa recommends, which would be calculating the powers of 7 in the thread, rather than reading from cache.

Any recommendations for further speedup would be appreciated, as this is my first attempt (just got the problem yesterday).

Are all load/store indices into “int Arr[16];” constants? If so then is NVCC correctly representing the entire array in 16 registers or is it spilling to local memory (bad!)?

I assume what was meant is “are all indexes into this array compile-time constants”, which is a necessary but not sufficient condition for a local array to be mapped to registers. Another consideration is the size of such an array. One could reasonably assume that sixteen 64-bit long longs (equivalent to 32 registers) would be OK on sm_35 in that regard, but I do not know the details of the heuristic used by the compiler. Best to check the machine code, as allanmac suggests.

Side remark: A local array stored in local memory is not the same as spilling. The compiler may decide to leave a local array in local memory, pulling it into registers is considered an optimization. If however, it first decides to pull the data into registers, then later finds that register pressure is too high and starts temporarily storing some of the data back into local memory, that would be spilling. For an array of read-only data, the difference should be readily discernible: Spilling would involve both local stores and local loads, while there would be only local loads if the array is simply allocated in local memory for the duration of the kernel.

Yes, I understand it’s not the same as spilling but I didn’t have a better word for, “… variables that you really think should be held in registers but are being unexpectedly demoted to local memory by the compiler.” :)

I would claim perspective is important: The default storage for a local array is local memory, where “local” means “thread-local”. The compiler may, as an optimization, promote the local array to register storage.

Whether such promotion takes place is subject to certain requirements (e.g. indexes must be compile-time constants) as well as heuristics (e.g. register pressure considerations, which in turn affects occupancy and thus performance). Like any set of heuristics, they may not lead to optimal results in each and every case. If there is strong evidence that the heuristics deliver poor results across a considerable portion of the code universe, they are subject to review and possible tweaking. This is where relevant bug reports come into play.

Just trying to figure out the best/fastest way to compute the powers of 7 (inside the kernel) for a long long type. Should I use the double function pow(double, double) from the cuda math library and cast to long long, or may that cause some rounding issues?

Or (since you said at most five steps) maybe this?

__device__ long long BigPow(long long num, int exp){
	long long ret=1LL;
	while(exp){
		if(exp&1)ret*=num;
		exp>>=1;
		num*=num;
	}
	return ret;
}

I would do it like your BigPow() function, but using unsigned long long num and unsigned int exp. You might want to check whether guarding the squaring of num by if (exp) provides additional performance as this skips a final, unnecessary squaring of num.

On second thought, given that your platform is a K20c with fast double-precision support, you could also try calling pow (double, int), as results up to 7**17 are exactly representable in a double. This approach adds cost for conversion to and from long long, though. pow (double, double) has too much overhead (many special cases need to be sorted out) and the computation itself is expensive (high accuracy across full input domain requires elaborate computation) to be considered.

After some testing I found that storing the power array in constant memory is the fastest for the problem, at about 43 minutes running time for best board. With that same array cast as const restrict it takes about 47 minutes.

When I tried computing internally it took longer than a hour, so I gave that up after that.

Will try the double-long long cast of the powers internally next. It is kind of an interesting problem, so I am going to play with it to see what works best.

Thanks.

Not really knowing anything about the code I would guess much in terms of performance depends on the distribution of the exponent values used in exponentiation, and so experimentation is a good way to find the best solution for a particular GPU. The optimal variant could differ between GPU families, so ultimately you may want to contemplate some sort of auto-tuning framework.

Someone had contacted me online to write a CUDA version of the Matrix-Sum (4x4) game, and that is what lead to the question in this thread.

I actually posted this project to Github:

https://github.com/OlegKonings/CUDA_Matrix_Sum_Game

and indeed found that for this example using constant memory was the fastest option when compared to using const restrict or manually calculating each power of 7 value in each thread.

What is a mystery is why does this problem with 33,232,930,569,601 possible configurations only take 44.14 minutes to calculate , when my permutation code for 16! with 20,922,789,888,000 possible configurations take 28 hours to complete?

hmm… I know the ‘evaluation’ step for the permutation version is a bit longer, but still…

Did you try putting the array into shared memory? The only annoyance is initializing it at the start of the block.

Will try that today, since it actually is a small array.

Wonder if Nvidia ever plans on increasing the amount of constant memory available. A great deal of my work ends up using constant memory and I would love to have twice as much…