const restrict read faster than constant ?

CudaaduC · January 15, 2014, 6:30am

More than once I have noticed that when I have a fixed size read-only array which is small enough to fit in constant memory, it instead seems to be read faster when accessed via const T* restrict Arr (via parameter in kernel call).

This is for the Tesla K20c… Would there be any reason why this is the case?

At a minimum it is not faster to use constant T *Arr for the same, at least not on my machine.
odd.

njuffa · January 15, 2014, 8:51am

Are your array accesses uniform across the warp, in other words, do all threads in a warp access the same address? The constant cache has a single port that broadcasts data to each thread in a warp. If more than one address is accessed across the warp, the cache access needs to be repeated as many times as there are different addresses. In other words, these accesses become serialized.

Use of const restrict pointers facilitates (does not guarantee!) the use of the LDG instruction that was introduced in Kepler. The LDG instruction is like a global load, except that data is transported through the texture cache instead of the regualar L1/L2 cache hierarchy. Since the texture cache is not coherent, only data that is read-only and is not modified for the entire duration of the kernel execution can be read via LDG.

You can use cuobjdump --dump-sass to check for the presence of LDG instructions in the machine code (SASS).

wlangdon · January 15, 2014, 4:12pm

Dear aduC,
Could I support njuffa. In my experience in practice constant memory is surprisingly
slow and difficult to use. See previous discussion

Bill

CudaaduC · January 15, 2014, 6:13pm

In my current situation there is a small long long array which caches the powers of 7 up to 7^17. Each thread reads multiple times from this array, though each will read a subset of that array which will not be the same across the warp.

This specific code takes hours to run, so I will test both methods and report my results.

njuffa · January 15, 2014, 7:39pm

You might want to consider yet another variant where the required powers are computed on the fly. With the maximum power being 17 it should take at most five steps. It is hard to predict whether that would be faster (64-bit integers are emulated), but a general observation is that with modern GPUs, arithmetic is often “too cheap to meter” compared to memory accesses, even when cached.

CudaaduC · January 15, 2014, 8:05pm

My first run actually only took 47 minutes, which was less than I expected given the huge number of calculations.

The problem I solved was one where you have a 4x4 grid with numbers from 0 to 6 inclusive. The goal is to determine which board arrangement results in the most rows+columns+diagonals which sum to exactly ten.

So there are 7^16 possible arrangements, but I have to generate each in local memory, determine the score, cache the warp best results, cache the block best results, reduce scan etc.

This was my result board:

0 0 5 5
1 6 0 3
4 0 4 2
5 4 1 0

And here is a pastebin of the beta-version first try code:

[url]http://pastebin.com/vETn9cPY[/url]

Note: that will be only posted for 24 hours

I used techniques discussed by many of the members of this board, and think it is correct code, but who knows?

My next version will do exactly as njuffa recommends, which would be calculating the powers of 7 in the thread, rather than reading from cache.

Any recommendations for further speedup would be appreciated, as this is my first attempt (just got the problem yesterday).

allanmac · January 15, 2014, 8:18pm

Are all load/store indices into “int Arr[16];” constants? If so then is NVCC correctly representing the entire array in 16 registers or is it spilling to local memory (bad!)?

njuffa · January 15, 2014, 8:37pm

I assume what was meant is “are all indexes into this array compile-time constants”, which is a necessary but not sufficient condition for a local array to be mapped to registers. Another consideration is the size of such an array. One could reasonably assume that sixteen 64-bit long longs (equivalent to 32 registers) would be OK on sm_35 in that regard, but I do not know the details of the heuristic used by the compiler. Best to check the machine code, as allanmac suggests.

Side remark: A local array stored in local memory is not the same as spilling. The compiler may decide to leave a local array in local memory, pulling it into registers is considered an optimization. If however, it first decides to pull the data into registers, then later finds that register pressure is too high and starts temporarily storing some of the data back into local memory, that would be spilling. For an array of read-only data, the difference should be readily discernible: Spilling would involve both local stores and local loads, while there would be only local loads if the array is simply allocated in local memory for the duration of the kernel.

allanmac · January 15, 2014, 9:04pm

Yes, I understand it’s not the same as spilling but I didn’t have a better word for, “… variables that you really think should be held in registers but are being unexpectedly demoted to local memory by the compiler.” :)

njuffa · January 15, 2014, 10:28pm

I would claim perspective is important: The default storage for a local array is local memory, where “local” means “thread-local”. The compiler may, as an optimization, promote the local array to register storage.

Whether such promotion takes place is subject to certain requirements (e.g. indexes must be compile-time constants) as well as heuristics (e.g. register pressure considerations, which in turn affects occupancy and thus performance). Like any set of heuristics, they may not lead to optimal results in each and every case. If there is strong evidence that the heuristics deliver poor results across a considerable portion of the code universe, they are subject to review and possible tweaking. This is where relevant bug reports come into play.

CudaaduC · January 15, 2014, 11:59pm

Just trying to figure out the best/fastest way to compute the powers of 7 (inside the kernel) for a long long type. Should I use the double function pow(double, double) from the cuda math library and cast to long long, or may that cause some rounding issues?

CudaaduC · January 16, 2014, 1:41am

Or (since you said at most five steps) maybe this?

__device__ long long BigPow(long long num, int exp){
	long long ret=1LL;
	while(exp){
		if(exp&1)ret*=num;
		exp>>=1;
		num*=num;
	}
	return ret;
}

njuffa · January 16, 2014, 5:32am

I would do it like your BigPow() function, but using unsigned long long num and unsigned int exp. You might want to check whether guarding the squaring of num by if (exp) provides additional performance as this skips a final, unnecessary squaring of num.

On second thought, given that your platform is a K20c with fast double-precision support, you could also try calling pow (double, int), as results up to 7**17 are exactly representable in a double. This approach adds cost for conversion to and from long long, though. pow (double, double) has too much overhead (many special cases need to be sorted out) and the computation itself is expensive (high accuracy across full input domain requires elaborate computation) to be considered.

CudaaduC · January 20, 2014, 5:26am

After some testing I found that storing the power array in constant memory is the fastest for the problem, at about 43 minutes running time for best board. With that same array cast as const restrict it takes about 47 minutes.

When I tried computing internally it took longer than a hour, so I gave that up after that.

Will try the double-long long cast of the powers internally next. It is kind of an interesting problem, so I am going to play with it to see what works best.

Thanks.

njuffa · January 20, 2014, 6:28pm

Not really knowing anything about the code I would guess much in terms of performance depends on the distribution of the exponent values used in exponentiation, and so experimentation is a good way to find the best solution for a particular GPU. The optimal variant could differ between GPU families, so ultimately you may want to contemplate some sort of auto-tuning framework.

CudaaduC · February 25, 2014, 10:58pm

Someone had contacted me online to write a CUDA version of the Matrix-Sum (4x4) game, and that is what lead to the question in this thread.

I actually posted this project to Github:

[url]https://github.com/OlegKonings/CUDA_Matrix_Sum_Game[/url]

and indeed found that for this example using constant memory was the fastest option when compared to using const restrict or manually calculating each power of 7 value in each thread.

What is a mystery is why does this problem with 33,232,930,569,601 possible configurations only take 44.14 minutes to calculate , when my permutation code for 16! with 20,922,789,888,000 possible configurations take 28 hours to complete?

hmm… I know the ‘evaluation’ step for the permutation version is a bit longer, but still…

SPWorley · February 25, 2014, 11:36pm

Did you try putting the array into shared memory? The only annoyance is initializing it at the start of the block.

CudaaduC · February 26, 2014, 12:00am

Will try that today, since it actually is a small array.

Wonder if Nvidia ever plans on increasing the amount of constant memory available. A great deal of my work ends up using constant memory and I would love to have twice as much..

Topic		Replies	Views
Constant Arrays CUDA Programming and Performance	13	30858	November 24, 2007
Really slow constant memory Random access to constant memory CUDA Programming and Performance	13	4691	December 4, 2009
How fast is a const array defined in a kernel? CUDA Programming and Performance	9	1425	April 21, 2022
__constant__ memory in function scope CUDA Programming and Performance	13	4970	June 1, 2015
Using lookup table in constant memory CUDA Programming and Performance	5	3548	January 10, 2019
constant cache no faster than global mem? constant memory access CUDA Programming and Performance	5	5804	March 6, 2008
Constant memory accesses by a threads of a half warp CUDA Programming and Performance	10	2089	October 27, 2014
Fast constant memory access from all threads in CUDA CUDA Programming and Performance	0	409	September 11, 2019
Constant memory 64KB only 8KB usable? CUDA Programming and Performance	13	7241	February 15, 2011
Warp Serialisation and Constant Memory Performance Surprise CUDA Programming and Performance	7	4018	March 3, 2009

const __restrict__ read faster than __constant__ ?

Related topics

const restrict read faster than constant ?