Reducing register usage of a quaternion class


I am trying to reduce the register usage of the following quaternion class. Especially I need to reduce the register usage of the rotateVector method. From what I have read in the forum, there is no deterministic rule of how the compiler assigns registers. I got stuck with the following code which I tried to optimize via trial and error. Hopefully you can give me some hints how I can further reduce the register usage.

Thanks for your help


the Code:

[codebox]class cuQuat



float4 m_fDat;

__device__ cuQuat(){}

__device__ cuQuat(const float4 &q)


	m_fDat.x = q.x;

	m_fDat.y = q.y;

	m_fDat.z = q.z;

	m_fDat.w = q.w;


__device__ cuQuat(const cuQuat &q)


	m_fDat.x = q.m_fDat.x;

	m_fDat.y = q.m_fDat.y;

	m_fDat.z = q.m_fDat.z;

	m_fDat.w = q.m_fDat.w;


__device__ void set(const volatile float &x, const volatile float &y, const volatile float &z)


	volatile float tmp;

	m_fDat.x = x;

	m_fDat.y = y;

	m_fDat.z = z;


	tmp = 1.0f - x*x - y*y - z*z;

	if(tmp  > 0.000001) 

		m_fDat.w = sqrtf(tmp);



		m_fDat.w = 0.0f;

		tmp = x*x + y*y + z*z;

		tmp = rsqrtf(tmp);

		if(tmp > 0.000001)


			m_fDat.x *= tmp;

			m_fDat.y *= tmp;

			m_fDat.z *= tmp;




// vector v3f has to be normalized, angle is in radian

__device__ void fromAxisAngle(const float3 &v3f, float fAngle)


	fAngle *= 0.5f;

	float sinA = __sinf(fAngle);

	m_fDat.x = v3f.x * sinA;

	m_fDat.y = v3f.y * sinA;

	m_fDat.z = v3f.z * sinA;

	m_fDat.w = __cosf(fAngle);


__device__ void rotateVector(const volatile float &x, 

					   const volatile float &y, 

					   const volatile float &z, 

					   volatile float &xRet,

					   volatile float &yRet,

					   volatile float &zRet)


	volatile float f1, f2, f3, f4;

	f1 = m_fDat.x; f1 *= f1;

	f2 = m_fDat.y; f2 *= f2;

	f3 = m_fDat.z; f3 *= f3;

	f4 = m_fDat.w; f4 *= f4;


	xRet = (f4 + f1 - f2 - f3) * x;

	yRet = (f4 - f1 + f2 - f3) * y;

	zRet = (f4 - f1 - f2 + f3) * z;

	f1 = m_fDat.x;

	f2 = m_fDat.y;

	f3 = m_fDat.z;

	f4 = m_fDat.w;

	f1 *= f2; f3 *= f4;

	f2 = f1 - f3; f2 *= y; xRet += 2.0f * f2;

	f2 = f1 + f3; f2 *= x; yRet += 2.0f * f2;		

	f1 = m_fDat.x;

	f2 = m_fDat.y;

	f3 = m_fDat.z;

	f4 = m_fDat.w;

	f1 *= f3; f2 *= f4;

	f3 = f1 - f2; f3 *= x; zRet += 2.0f * f3;

	f3 = f1 + f2; f3 *= z; xRet += 2.0f * f3;

	f1 = m_fDat.x;

	f2 = m_fDat.y;

	f3 = m_fDat.z;

	f4 = m_fDat.w;

	f1 *= f4; f2 *= f3;

	f3 = f2 - f1; f3 *= z; yRet += 2.0f * f3;

	f3 = f2 + f1; f3 *= y; zRet += 2.0f * f3;


__device__ void conjugate(const cuQuat &q)


	m_fDat.x = -q.m_fDat.x;

	m_fDat.y = -q.m_fDat.y;

	m_fDat.z = -q.m_fDat.z;

	m_fDat.w =  q.m_fDat.w;


__device__  void normalize() 


	float fNorm = rsqrtf(m_fDat.x*m_fDat.x + m_fDat.y*m_fDat.y + 

					    m_fDat.z*m_fDat.z + m_fDat.w*m_fDat.w);

	if(fNorm > 0.000001)


		m_fDat.x *= fNorm;

		m_fDat.y *= fNorm;

		m_fDat.z *= fNorm;

		m_fDat.w *= fNorm;



__device__ friend cuQuat operator*(cuQuat &q1, cuQuat &q2)


	cuQuat res;

	res.m_fDat.x  = q1.m_fDat.x * q2.m_fDat.w;

	res.m_fDat.x += q1.m_fDat.y * q2.m_fDat.z;

	res.m_fDat.x -= q1.m_fDat.z * q2.m_fDat.y;

	res.m_fDat.x += q1.m_fDat.w * q2.m_fDat.x;


	res.m_fDat.y  = q1.m_fDat.y * q2.m_fDat.w;

	res.m_fDat.y -= q1.m_fDat.x * q2.m_fDat.z;

	res.m_fDat.y += q1.m_fDat.z * q2.m_fDat.x; 

	res.m_fDat.y += q1.m_fDat.w * q2.m_fDat.y;


	res.m_fDat.z  = q1.m_fDat.x * q2.m_fDat.y;

	res.m_fDat.z -= q1.m_fDat.y * q2.m_fDat.x;

	res.m_fDat.z += q1.m_fDat.z * q2.m_fDat.w;

	res.m_fDat.z += q1.m_fDat.w * q2.m_fDat.z;


	res.m_fDat.w  = q1.m_fDat.w * q2.m_fDat.w;

	res.m_fDat.w -= q1.m_fDat.x * q2.m_fDat.x;

	res.m_fDat.w -= q1.m_fDat.y * q2.m_fDat.y;

	res.m_fDat.w -= q1.m_fDat.z * q2.m_fDat.z;


	return res;




How many registers are you using?

You don’t even look like you make any accesses to global memory…

Hello, it is true that the methods work on register or shared mem variables only. I use the volatile keyword here as I have read in the forums that it may help to reduce register usage if I force variables into registers. The rotateVector method assigns 7 registers in my code.

7 registers is small.

If you’re not making frequent accesses to global memory, you have to worry about occupancy that much less (which you shouldn’t be worrying much about in the first place).

256 threads per block is a fine amount, which on G200 allows you 64 registers per thread.

Hello I mentioned that wrong, actually my kernel makes excessive use of global memory, I meant it is not used in the quaternion class for computation directly. I will have a 50% occupancy on a g200 as my complete kernel uses 25 registers currently, I would like to bring the overall usage down. and the rotateVector method is a key element in my code. Furthermore I have to wait for my g200 card till christmas ;(, currently I am working with a gtx8800 which gives me only 33% occupancy

50% occupancy on G200, ie 512 threads, is enough to hide latency in even the worst case scenario (constant TLB misses). 33% on G80, ie 256 threads, is enough to completely hide latency in common usage (when the TLB cache does its job).

Do not aim for 100%. It is unfortunate that NVIDIA never articulates this in its documentation.

Unfortunately this would mean I have to wait till christmas to see my code working, this is not a great option, and on the gtx8800 getting the code down to 16 regs has a great impact on performance. Therefore I would appreciate every hint to reduce the register usage.

What do you mean “see your code working”? It is working already. You might say, “working as fast as it could,” but you won’t see that until you get your G200 anyway.

What do you mean by “great impact” on 8800? (What % speedup.) And is this the identical code with simply fewer registers, or some modified code?

And how many GB/s are you doing? It might be that you are already maxing out your bandwidth, so reducing register usage is useless in that case

If I comment the rotation vector code out it gets under 16 registers, i do not know what is optimized out in this situation but my kernel runtime drops from about 27ms to under 10ms. Actually this does not matter a lot as this kernel simply is not useful.

Actually this is a good question, I will check this tomorrow, but because of the performance drop when i get the kernel unter 16 regs i am not yet at or near to bandwith limit.

You don’t know what gets optimized out. Could be a lot. (The compiler is aggressive in identifying dead code.)

I’m confident that you’ll get a very small boost, if any, by running more than 256 threads per MP on G80. A boost that will surely be erased on G200.

If you’re not maxing out your bandwidth, it is either because you’re not issuing perfectly coalesced reads (a huge factor, much more important than occupancy) or because you’re using __syncthreads() which hurts calc-memfetch overlap. In the latter case, running two blocks per MP will help.

I think loads from global memory are also optimized away if the values are never used, so check that also. It might be just that you are reading in less data when commenting things out.

alright i will check for coallesced reads in any case. actually i am using __syncthreads in my kernel a lot, as i have to do a reduce operation which spans a binary tree over shared memory (similar to the reduce examples in the cudpp lib). I thought if i could get the kernel down to 16 regs. In that case I could use two blocks per MP, this is currently limited to 1 block on my gtx8800 due to regs.

Traversing binary trees in shared mem sounds like it could cause massive bank conflicts. This is another very important optimization factor (ie potentially 10x speedup).

It’s much more important than occupancy or running two blocks (potentially 1.5x speedup).

What kind of reduction are you doing on the binary tree? Depending on what kind of operations you are performing, you might want to try to find an old assembly book, or search the web for ‘assembly function optimization branch’. There are lots of neat old assembly tricks that use bit shifting, logic functions and so forth to remove branches from the code, which is a big deal on some architectures. If you’re doing something like "if a > b, c = a, else c = b’ you might be able to remove the branch there with the optimization and save yourself some warp divergence.

Just a thought…

Here’s an example (in x86 assembly…but you could easily redo this in C for use with CUDA): [url=“32/64-Bit 80x86 Assembly Language Architecture - James Leiterman - Google Books”][/url]

EDIT: Does anyone know if nVidia’s compiler already does these type of optimizations for built-in functions like max() and so forth? Since CUDA performance takes a big hit from divergent warps, perhaps they could include a header with some optimized macros for common functions in the 2.1 release of CUDA. It wouldn’t take but a minute to put together, and it could make a big performance difference for some people that are not used to the architecture.

The statement “if a > b, c = a, else c = b” does not cause divergence, it results in predicated instructions. These are similar to masks and bitshifts in principle (which themselves are similar to divergence, in principle), but the implementation has low overhead.

No need, MIN and MAX are native PTX opcodes for both integer and floating point.

Christoph John - Out of curiosity, what sort of system does this class belong to? (If I were to guess, you could be doing some collision detection like me)


thanks for the hint,

but I am just adding values together, I think due to the offset there should not be bank conflicts with this code.


for(unsigned offset = BLOCK_DIM_X>>1; offset > 0; offset >>= 1)


// ensure last summing cycle has been finished for all threads in block


if(threadIdx.x < offset)

	LocalBlock[threadIdx.x] += LocalBlock[threadIdx.x + offset];