Reducing register usage of a quaternion class

Christoph_John · December 11, 2008, 11:36am

Hello, I am currently implementing a partilce filter system for upper body tracking in a 3D teleconferencing environment . The particle evaluation should be processed on the gpu.

In my setup I compute a 3D volumetric reconstruction from several camera images on a gpu which is used as input for the particle filter system.

You can find some example videos on my homepage at

http://www.mi.fh-wiesbaden.de/~cjohn/research.html

The videos just show reconstructions of skin colored regions. But currently I do this reconstruction on foreground and skin colored objects, thus reconstructing the whole person.

I have a kinematic model of the upper body of a person. On each bone you find a quadric attached. Each particle of the particle filter now describes a possible kinematic configuration of the model. The task at hand is now to find the body configuration with the highest likelihood, which is given with the body configuration resulting in the highest overlap of body model and occupied reconstructed volume. (its slightly more involved, but this is the general idea)

alex_dubinsky · December 11, 2008, 6:43pm

Hello,

thanks for the hint,

but I am just adding values together, I think due to the offset there should not be bank conflicts with this code.

[codebox]

for(unsigned offset = BLOCK_DIM_X>>1; offset > 0; offset >>= 1)

{
// ensure last summing cycle has been finished for all threads in block

__syncthreads();

if(threadIdx.x < offset)

	LocalBlock[threadIdx.x] += LocalBlock[threadIdx.x + offset];
}[/codebox]

Right, no bank conflicts. Is it possible to perform several of these reductions in parallel? This would be more efficient. Possibly this would also be somewhat better: (for 256-thread blocks)

for(unsigned offset = BLOCK_DIM_X>>2; offset > 0; offset >>= 2)

{	

	// ensure last summing cycle has been finished for all threads in block

	__syncthreads();

	if(threadIdx.x < offset)

		LocalBlock[threadIdx.x] += LocalBlock[threadIdx.x + offset] + LocalBlock[threadIdx.x + offset*2] + LocalBlock[threadIdx.x + offset*3];

}

This is fascinating. I haven’t studied math as much as I should. Is there a good book you can recommend to learn about such things? (Quaternion math and its applications) Something intuitive and not too dry.

Christoph_John · December 12, 2008, 12:11pm

Right, no bank conflicts. Is it possible to perform several of these reductions in parallel? This would be more efficient. Possibly this would also be somewhat better: (for 256-thread blocks)
for(unsigned offset = BLOCK_DIM_X>>2; offset > 0; offset >>= 2)

{	

	// ensure last summing cycle has been finished for all threads in block

	__syncthreads();

	if(threadIdx.x < offset)

		LocalBlock[threadIdx.x] += LocalBlock[threadIdx.x + offset] + LocalBlock[threadIdx.x + offset*2] + LocalBlock[threadIdx.x + offset*3];

}
This is fascinating. I haven’t studied math as much as I should. Is there a good book you can recommend to learn about such things? (Quaternion math and its applications) Something intuitive and not too dry.

Hello, for my task at hand this is the most parallel reduction possible, as I do have only 256 registers with data to sum up. For Quaternion math a good starting point is

http://www.euclideanspace.com/maths/index.htm

Fugl · December 12, 2008, 1:09pm

Hello, I am currently implementing a partilce filter system for upper body tracking in a 3D teleconferencing environment . The particle evaluation should be processed on the gpu.

In my setup I compute a 3D volumetric reconstruction from several camera images on a gpu which is used as input for the particle filter system.

You can find some example videos on my homepage at

http://www.mi.fh-wiesbaden.de/~cjohn/research.html

The videos just show reconstructions of skin colored regions. But currently I do this reconstruction on foreground and skin colored objects, thus reconstructing the whole person.

I have a kinematic model of the upper body of a person. On each bone you find a quadric attached. Each particle of the particle filter now describes a possible kinematic configuration of the model. The task at hand is now to find the body configuration with the highest likelihood, which is given with the body configuration resulting in the highest overlap of body model and occupied reconstructed volume. (its slightly more involved, but this is the general idea)

Very interesting topic. I’ve seen a few videos of it before, and have always been impressed by the stability of the tracking.

I can see that you are using axis-aligned bounding boxes as bounding volumes - wouldn’t arbitrarily oriented bounding boxes provide you with even more information? Or are you required to keep the degrees of freedom down to a reasonable level due to time constraints?

E.D_Riedijk · December 12, 2008, 1:49pm

I think the bounding boxes are just used as a fast way of generating constraints for the particles (if x< a || x > b is very easy to code ;)).

The particle cloud contains much more information probably

alex_dubinsky · December 12, 2008, 6:09pm

That is a wonderful site. Thanks!

(If anyone has any more, please post. Esp books.)

Christoph_John · December 12, 2008, 6:30pm

Hello, the hand tracking video shows just some first tests to evaluate image processing algorithms which needed to be aware of foreground objects, which in that case were hands. Currently I am developing an upper body tracking system which has 14 DOF. I will post some videos if its working (could take a while).

Christoph_John · December 12, 2008, 6:35pm

Actually the particle cloud represents a probability distribution of the current state of hands and head volume. What you see in the videos are 3 interacting particle filters each with 200(head 300) particles, just the modes of the probability distribution are drawn (red boxes). For this each particles consists of its pose and recent history (brownian motion and first and second order autoregressive models)

Topic		Replies	Views
Register usage of a device function for vector rotation CUDA Programming and Performance	14	714	June 12, 2022
Optimize code from 38 registers to 21 registers CUDA Programming and Performance	48	9513	August 31, 2010
How to reduce register usage CUDA Programming and Performance	47	49596	May 28, 2022
Questioning compiler's use of registers, how to get compiler to use registers more efficiently CUDA Programming and Performance	10	5227	March 30, 2009
Weird use of registers Too many registers are wasted CUDA Programming and Performance	8	5479	July 4, 2007
How to optimize my cuda code? CUDA Programming and Performance	14	2059	June 28, 2023
Lowering register usage CUDA Programming and Performance	14	4558	October 10, 2008
Speedy general reduction code ( 83.5 % of peak) Works for any size CUDA Programming and Performance	44	30435	October 29, 2010
Register usage too high How to reduce register usage? CUDA Programming and Performance	33	7338	December 4, 2011
Kernel optimization and register usage reduction reducing the banching. CUDA Programming and Performance	7	2516	August 6, 2008

Reducing register usage of a quaternion class

Related topics