Barra, a GPU Simulator to run CUDA apps

Hello everyone,

We are working on a GPU simulator capable of running CUDA applications, named Barra.

Barra simulates CUDA programs at the assembly language level (native ISA, not PTX).
It works directly with CUDA executables; neither source modification nor recompilation is required.

All source code is available under a BSD license, and binaries for GNU/Linux are provided.

More information and download links for release 0.1:

Barra is work-in-progress. We only support a (growing) subset of the G80 ISA and CUDA Driver API, but already enough to run many samples from the SDK.
(Unfortunately we don’t support OpenGL interop yet, so no cool video, sorry. ;))

We would appreciate your feedback. Does your favorite CUDA app run on Barra? :)


Does this mean you are having to build upon and/or continue the ISA reverse engineering work done for decuda?

Yes. I have to acknowledge that I relied heavily on decuda/cudasm to write Barra’s instruction decoder. Thanks to wumpus work I have been able to quickly catch up. But as I progressed, I had to do my own reverse engineering.

Especially as decuda doesn’t tell you the instruction semantics, which is sometimes all but straightforward (e.g. how FP instructions affect condition codes…)

So currently, there are instructions and encodings that are recognized by Barra and not decuda, as well as the opposite.

Barra’s support is stronger for basic arithmetic operations (MAD especially), while decuda supports more varied instructions (texture sampling, atomics…)

I thought about backporting everything I found in decuda/cudasm, but I just suck at coding in Python :)

(well, I actually have a hacked version of decuda which understands a few extra instructions, but it needs some serious clean-up…)

In the meantime, you can use Barra just to disassemble kernels, although it does not look as nice as decuda (no labels, no coloration…)

Sylvain, thanks so much for your impressive work! It’s a bit overwhelming at first, but I’m not complaining.
Kudos to you and David and David.

I am especially enjoying the details in the technical report right now, it’s a different way of thinking about the problems of simulation!

I’m sure you will get a ton of feature requests. I haven’t even gotten the simulator installed yet, but I’m already dreaming of asking it to simulate my kernels with different hardware configurations (what happens if global memory latency was higher or lower? Device memory bandwidth? PCIe transfer speed?).

I know my current kernels won’t run (I use atomics too much!) but I’m looking forward to such a powerful new tool. Thanks again from everyone in the CUDA community!


That sounds awesome!! Great work!

btw, why the name “Barra”? What does it mean?


Barra is a small island in Scotland:

Or rather, it is just supposed to be a (bad) pun with CUDA. Barra runs CUDA, so BarraCUDA. :">

Thanks… Heard this name “BarraCUDA” in many places (like Seagate BarraCUDA disk drive…)

Wiki tells me its a kind of fearful and dangerous fish…

Living in this side of the world and being a veggie, I have little clue about this fish.

Is barraCUDA a kindaa fantasy in the west? seen as symbol of power?

Nah, it’s just a crazy fish with a temper problem and a great theme song. ;)

Ah… I c. Thanks!

downloading …

Hi Sylvain,

Just thought that I would bring this to your attention since it is related to your simulator. This paper describing another GPU simulator based on PTX like our emulator, but including a full timing model built on SimpleScalar, was presented this week at ISPASS 2009:

For anyone else interested in NVIDIA-like GPU architectures, the paper presents a design exploration into the on-chip interconnect, hardware schemes for handling branch divergence, dram request scheduling algorithms, and adding L1/L2 caches to multiprocessors.




Thanks for the link, I didn’t know about this paper.

Their work is certainly impressive. They have a fully working multi-core simulator with timing models for both the cores and the interconnect.

However I am still not convinced that PTX is the way to go for architectural simulation. I often saw important differences between unoptimized PTX and the corresponding native assembly code. This also prevents modeling tradeoffs between using more registers and running few blocks, or do more computations / spill to local memory to save registers and run more blocks. Maybe a PTX-to-PTX optimizing compiler might help fill this gap…
Also, if I read the paper correctly, their simulated memory latency is an order of magnitude less than the (unloaded) latency measured on NVIDIA GPUs, which makes me wonder how much the discussion about interconnect topology and caches and the conclusion that latency does not matter applies to an NVIDIA-like architecture.

Still, they seem to be far ahead of us in terms of features…

Their simulator is based off of a previous project 2 years ago for simulating GPU like architectures ( dynamic warp formation ) that was finished just as CUDA was released, which gave them a head start as they just had to add a PTX front end.

I agree that as far architecture simulation goes, simulating the native instruction set is preferable to PTX, but most people aren’t left with any other choice since probably they do not consider reverse engineering the native ISA feasible - obviously you have proven that statement wrong but I would imagine that many people are still reluctant to try.

As far as register usage, from what I understood it looked like they grabbed the register usage from the CUBIN generated for each kernel and then did a register allocation phase on the PTX to get down to the same number of registers used by the native binary. Of course their register spilling scheme could be completely different from what is implemented in the PTX JIT.

Also, where did you see latency numbers for DRAM? To me it looked like they just gave interconnect latency to get to DRAM and then modelled GDDR3 latency with a dram simulator with the generic memory timing parameters (tRC, tRAS, etc) as inputs.

I had the impression that they don’t perform register allocation at all and simulate an infinite number of registers, the register usage from ptxas being only used for block scheduling.

This is fine as long as ptxas does not spill registers, as they justify in footnote 4. This approach just does not allow to simulate benchmarks that need register spills.

This was my understanding too. If they proceed this way, the total unloaded memory latency should be around 30 to 60ns. Whereas on NVIDIA GPUs, the global memory latency (both as documented and as measured) is roughly 500ns in the best case (no contention, no TLB miss…)

On the other hand, their simulated warps block as soon as they encounter a memory operation, whereas NVIDIA GPUs allow up to 6 pending transactions per warp. So they should have much less latency tolerance. I would expect both effects to more or less compensate each other.

Of course, simulating an architecture that is different from NVIDIA’s is fine, but it makes it more difficult to support claims that it is representative of industry practice.

In an attempt to catch up…

Version 0.2 is available, still from

Changes include:

  • Much faster simulation, thanks to many parts having been rewritten.
  • More instructions supported (64-bit and 128-bit mov, break, conversions…)
  • Statistics about the number of instructions executed, branching efficiency, instruction mix, register file bandwidth for each kernel instruction can be generated and imported in any spreadsheet application.

Edit: and also, a tutorial to show it’s not that hard.