Game Physics experiment with CUDA

Hello all

As a school project I’ve made an experiment - an attempt to speed up game physics calculations using CUDA.
I’d like to share the results and get feedback about this stuff.

In this experiment I used a code of Cyclone Physics Engine - A physics engine made for learning purposes and is actually
an accompanyng code for Ian Millington’s book Game Physics Engine Development.

Info can be found here:

I’ve converted Rigid Body Integration function and collision detection functions to CUDA kernels and compared the
running times of the functions on GPU and CPU.

In first test I compared the rigid body integration performance. The workflow involved copying forces and torques
to device in every frame (two arrays, 4 floats in each array per body), running the kernel and copying transform matrices
for rendering the bodies with GLUT to device (16 floats per body, single array of matrices).

The execution times were as follows:

In the next test I compared overall performance of a workflow which included rigid body integration as before,
sphere-halfspace collision detection and contact resolution (Runs on CPU). The memory copies per frame were forces and
torques as before, generated contact data (An array of structures, 36 bytes per contact), copying body state data
(16 floats per body) to host and then copying body state adjustments to device (16 floats per body).

The performance comparision gave the following results:

I am still a CUDA newbie and I know there’s still a lot of optimizations and further work to be done,so
all this is just the beginning.

Anyway, I’d like to hear your suggestions, comments and feedback.


Cyclone, that rang a bell. As you said, it’s the physics engine Ian Millington’s book “Game Physics Engine development”. I was reading this book just recently.

Are you willing to share some of your CUDA code (now or later on maybe) ?


Not a problem, I just would like to clean the mess in the code before that (and before I submit the work),

it will take me a week or so. Then I can send it to everyone who’ll be interested.

My mail is:

Great work!

I know PhysX has yet to get rigid bodies and collision detection on GPU, presently only supporting fluids, cloths and soft-bodies.

Perhaps it would be a good idea to limit CPU<->GPU transfers. It’s gonna be tricky, especially since some parts are still calculated on the CPU and bottom line is, you have to copy back relevant stuff to do game logic, rendering etc. Yet, there might be some room for optimization. Perhaps you’ve already thought of or even implemented this:

  • most of the bodies don’t move all the time and can be effectively put to sleep. You’d need a mechanism to signal which bodies have been updated on the GPU and require copyback and which are asleep.
  • probably there’s no need to copy all the forces and torques every frame - you can keep them within device memory and only update those affected by the player (ex. if he presses “forward”, send a small update that his body just got a forward force vector). You also don’t need to copy forces from GPU to CPU unless specifically needed by game logic and even then you might just copy the necessary ones. Forces are invisible, so are velocities, you’re more concerned about positions and orientations which you will need for rendering.

I’d also be interested in seeing the code and/or some documentation

Indeed, such a sleep system was implemented in the original Cyclone Physics engine.

At this stage however, I was interested to measure performance when CPU and GPU DO have to do the work,

so I left it out at this stage.

The main problem here is that it’s some trick to think about a mechanism that selects the forces and torques that were changed ,copy them at once and applied on the appropriate body on device. Since copying a lot of data with a single cudaMemcpy is much faster than copying the same data with several calls to cudaMemcpy (This actually is really slow),

this is something to think about.

Perhaps zero-copy (aka mapped memory) that was introduced in CUDA 2.2 would be an answer here, at least for those that have G200 cards or newer. I believe you could then simply write

__global__ void dataCopyBack(int changed[], float4 d_V[], float4 d_P[], float4 hm_V[], float4 hm_P[])


	int idx = threadId.x + ...

	if(changed[idx] > 0)


		hm_V[idx] = d_V[idx];

		hm_P[idx] = d_P[idx];



Assuming hm_V and hm_P are pointers to mapped host memory. You could write to such host memory directly from a kernel. The reads are coalesced. PCI logic should take care of coalescing the scattered writes into bursts. No explicit memcpys. This could work the other way around too, perhaps you’d memcpy to device an array of what has changed, and then let threads selectively read in only the changed float4s.

I only have an 8800 and cannot check this.

changed could be a bool array, I went with ints for coalescing.

Brilliant! Why didn’t I think of that? Here, I’ve been wracking my brain for effective ways to use zero-copy in my app and never came up with this one.

Let’s just hope that the PCIe burst logic is smart enough to effectively deal with “coalesced” reads/writes with gaps like this. If it does, this could mean much better multi-GPU scaling for me :) Unfortunately, I’m on vacation and won’t be able to try it out immediately :(

On vacation and still checking CUDA forums…now that’s dedication :lol:

Nice project, thanks for posting the results!

I was just wondering about this myself: is it worth offloading some of the game physics to the graphics card? After all, if you are doing complex rendering (shadow maps, HDR, etc…), isn’t it natural to be overlapping this work with the collisions, contacts, and rigid body calcs? In which case it would make more sense to keep those on the CPU?

Just a thought. I guess, though, that the Unreal engine uses PhysX (physics on the card), so maybe my thinking is wrong??

I’ve been recently playing Cryostasis - The Sleep of Reason game, the PhysX power is really visible there.

The game itself runs really slow, at least on my PC, but that’s probably because of the cold effects there.

A friend of mine has a 9800, and he says the game runs fine on his PC. Anyway, looks like it does make sense

to use GPU for physics, at least in case of Cryostasis it looks impressive.

And, about the suggestion above to use zero memory copy, unfortunately I can’t try that out either: I have a 8800 too…

As far as I know all available Physics engines that run on the GPU only use the GPU for non-critical visual physics effects, such as cloth effects, glass shattering, liquid simulations and such - but not for actual in game physics such as obstacles or stackable crates that will interact directly with the player’s character and may influence the player’s progress in the game. (Big Mac also stated this above in this thread). I believe this was done to avoid any dependency on a fully programmable GPU series which is not available in many gamer’s PCs (many gamers still use older hardware that would not support CUDA for example).

Also it allows for exact same behavior in multiplayer games where every player’s physics runs in much the same way on the CPU. The GPUs might differ a tiny bit in precision, giving a little competitive edge to some players.

I agree. Has there been any further progress on this project since?

I am very interested in getting some GPU accellerated physics, because I haven’t been able to simulate more than about 400 coins in realtime on a Core 2 Duo using the open sourced bullet physics library ( ). And this simulation still has too many flaws to be usable.

So I am just about to start implementing a little GPU based physics library myself, following Ian Millington’s book as a guide line - my approach will be entirely based on the Thrust template library. All of the physics state variables will be stored in device_vectors. The C++ physics objects (particles, rigid bodies, etc) will only contain an index to the respective position in the vector - the vectors themselves will be shared by all objects and grow as needed as new objects are created.

The device_vector storage should allow for coalesced memory access during most physics computations, which is nice. The physics calculations will be implemented as operators, as demonstrated in the “complicated transformations” Thrust code sample. I intend to not implement any dedicated CUDA kernels (no low level programming) . Hopefully Thrust will be versatile enough to do all I need - and if not, I will carefully try to extend Thrust. Thrust might be extended later to also do the computation on the CPU (maybe even supporting an ISSE2 accelerated code path), so there would be no strict CUDA dependence.


Just submitted it, and I think about some ideas to expand it - So far it implements on CUDA only the rigid body integration

and fine collision detection from Cyclone.

I can send you my code if you’ll post your email address, or send me an email.

Starting from scratch seems to be a lot of work to do - At least, you’ll have to implement the fine collision detection

between two cylinders (coins) yourself, and keep in mind that resizing vectors (memory allocations) make a serious impact on performance. Also, I’d go for an implementation of Coarse Collision Detection phase using BSP trees - That will allow you

to group contacts and implement the contact resolution routines on GPU efficiently - May be a considerable performance gain!

That’s one of the ideas I’m considering for further work.

Good luck!