CUDA/PTX Emulator Would Anyone Be Interested?

So my research group has developed a GPU emulator that implements all of the PTX 1.3 specification as well as the CUDA high level api as in 2.1 (we do not implement the driver level api). Unlike the emulator that ships with the CUDA SDK which compiles cuda programs into native code, our emulator takes the PTX representation of a program generated by the nvidia compiler, and emulates it directly, one instruction at a time. So far, we can link our implementation of the CUDA runtime using our emulator against most of the CUDA SDK that ships with 2.1 and run about 20% of the applications to completion.

We are using this software internally for analysis of CUDA applications and GPU architecture research. However, it occurs to me that this might be a useful tool for people developing CUDA applications as well (most notably the emulator can handle huge numbers of threads, can detect GPU memory protection violations, executes more or less the exact same instructions that would be executed on a GPU, and has the same concept of a warp as a GPU).

As of now, we are not planning on publicly releasing this emulator either in open source or binary form simply because we do not want to support it. However, if there is enough interest, it would probably be possible to do a public release of source code under a GPL like license or a proprietary release to a company under a BSD like license (for a fee). The idea would be for us to branch our code base after we feel that it is stable enough and pass off development to either the open source community or a company. It is a fairly lengthy process to package up our code base into something that we would want to release and fairly annoying to secure a license from our legal department so I am posting here to see if anyone is interested in picking up this project.

If you are interested in seeing our code base being posted under an open source license please respond to this post. If you are interested in branching our code base and continuing development under an open source license please email me at gregory.diamos@gatech.edu . If you represent a company that might be interesting in obtaining an unlimited usage license to distribute our code base for a fee please contact me at gregory.diamos@gatech.edu as well.

See the attached cuda source and linux 64 binary for an example of the emulator running in debug mode. Note that the binary was compiled from sequence.cu using the following steps:

nvcc --cuda sequence.cu
g++ sequence.cu.cpp -o Sequence libocelot.a -I /home/normal/cuda/include/

Thanks

Wow, this sounds very impressive! I think a full PTX implementation could be a useful starting point for several research projects:

  • A GPU “valgrind”: You already mention that the code can detect memory violations, but I imagine it could be also instrumented to detect read-after-write hazards, some forms of race conditions, uninitialized memory access, and other common mistakes. This is what people assume the device emulation mode in nvcc will do, and are surprised when they learn how many programming errors are uncaught by device emulation mode. Device emulation + valgrind can discover some of these, but not the threading errors which are more specific to CUDA.

  • CPU code generation: It sounds like some portion of this code could be leveraged to create the front-end of a PTX to multithreaded-SSE compiler. Current research on this topic has relied on source transformation of the original CUDA C code along with the auto-vectorization features of compilers to generate multicore+SSE code. It may be that directly converting PTX to SSE instructions with blocks distributed over multiple threads is the most efficient and simple way to run CUDA kernels on the host.

  • Semi-accurate performance prediction for different devices?: Ok, this one is a longshot since PTX is only an intermediate representation, further optimized by the driver. Nevertheless, it would be awesome if there was a program which contained hardware descriptions of common CUDA devices, and could predict the runtime of kernels on these devices. This would allow developers to understand the performance characteristics of their code on hardware they might not have (like mobile GPUs or compute capability 1.1 devices). This is going to be very hard because it will require reverse engineering many undocumented aspects of the GPU behavior, and require keeping a catalog of the attributes of NVIDIA’s ever growing (and frequently renamed) product line.

I’m not in a position to do any of this (I’m just a physicist), but hopefully someone out there is…

  • As of now, our emulator can detect synchronization deadlocks and memory violations. Valgrind would probably be much faster since it relies on binary translation and inserts instrumentation instructions to detect errors (although I’m not sure since valgrind is much slower than most binary translators), but it would certainly be possible to add equivalent functionality to our emulator.

  • We are actually working on this now using the emulator as a validation tool to ensure that our implementation is correct. It requires some careful handling to support packing warps into SSE instructions, but we believe that it is possible. We are targeting december 2009 to have an implementation that dynamically translates PTX to x86 for all of the cuda sdk.

  • This would be very difficult to do with all of the hardware nuances of nvidia gpus. It would be possible to use the emulator generate an instruction trace for any GPU and run it through an architecture simulator. If someone actually developed an architecture simulator for something that resembled an NVIDIA GPU, you might be able to get similar performance estimates.

We are planning on exploring the second two internally, but again do not plan to release any code unless someone else volunteers to continue to develop and maintain it.

This looks like an interesting project.

So we are not the only ones working on CUDA simulation…

We happen to be currently developing a GPU simulator running the native NVIDIA ISA.

It can run CUDA programs directly without any recompilation at a speed comparable with CUDA emulation (–deviceemu).

Since it runs the exact same instructions than the GPU executes, we can achieve a 100% binary-compatible emulation, which is not possible when emulating PTX, due to compiler optimizations.

And the source is readily available under a BSD license :) :

http://gpgpu.univ-perp.fr/index.php/Barra

For the moment, our simulator is only functional (like an emulator) but we intend to quickly add a performance model through Transaction-Level Modeling.

That’s great to hear that other groups are also working on CUDA simulation. Our goals for doing PTX emulation are a little different than yours for doing architecture simulation; since you mention that you run the native nvidia ISA. Your simulator would be a much better choice for modeling a specific gpu architecture, where our emulator is geared at being compatible with any GPU that supports PTX.

Our end goals are to be able to do PTX level optimizations that are valid across many different architectures and then selectively compile an application to one of many architectures in a system. The emulator gives us a tool for validating any of our optimizations to quickly ensure that they generate correct programs (something that is very hard to do if we run everything on a black box gpu).

Also, it’s great that you release your code open source. I was browsing through a bit of it and I like your choice to do the driver level API rather than the high level API. I am surprised that SDK programs will link against this without having the libcuda.so since I thought that the nvidia compiler referenced symbols to the high level API rather than the driver level API and that their implementation of libcuda.so used the driver level api internally. If this is not the case (I have no idea since i haven’t tried it), it would be must easier for us to use the driver level API instead (the high level API uses several implementation specific functions with obscure meanings __cudaRegisterVar anyone?).

Also, I could not find the source files for your implementation of the architecture simulator engine that actually executes the instructions. Looks like it should be <unisim/component/cxx/processor/tesla/cpu.hh> from the header files but I didn’t see it on the repository. Just curious.

Awesome work btw, I know first hand how hard this kind of implementation is!

Yes, your approach makes much more sense for a compiler developer point of view, ours for a hardware architect point of view.

The high-level CUDART API is implemented in libcudart.so, which then calls CUDA Driver API functions in libcuda.so.

For simulation, we replace libcuda.so with our own library, so that either the application or libcudart.so calls our own low-level functions.

This is possible as the simulator (Barra) is a dynamic library.

This should be doable. The key function is cuModuleLoadFatBinary.

If you use the appropriate NVCC flags to embed the PTX code in the CUDA executable you want to emulate, you will be able to emulate the produced executable directly.

It is in unisim_lib/unisim/component/cxx/processor/tesla:

https://guest@unisim.org/svn/devel/unisim_l…rocessor/tesla/

(yes, file organization in Unisim is somewhat… peculiar)

Thanks that helps me a lot. We use a linked library as well and replace libcudart.so. I am absolutely going to reimplement it to replace libcuda.so instead after reading this. In hindsight I should have looked at the symbols in both libraries first…

It is actually interesting to note that the fat binary embedded by NVCC includes the PTX representation of every kernel by default. I have not looked at the other fields in detail, but based on your implementation I assume they include the gpu specific binary as well.

For the emulator running on an Intel® Core™2 Duo CPU T7500 @ 2.20GHz in singe threaded mode.

Preliminary performance results:

Black Scholes:
Options count : 8000000
BlackScholesGPU() time : 10379.265625 msec
Effective memory bandwidth: 0.007708 GB/s
Gigaoptions per second : 0.000771
Reading back GPU results…
Checking the results…
…running CPU calculations.
Comparing the results…
L1 norm: 4.127388E-08
Max absolute error: 1.525879E-05
TEST PASSED

Aligned Types:
Allocating memory…
Generating host input data array…
Uploading input data to GPU memory…
Testing misaligned types…
uint8…
Avg. time: 8.706875 ms / Copy throughput: 0.005348 GB/s.
TEST PASSED
uint16…
Avg. time: 6.731156 ms / Copy throughput: 0.006918 GB/s.
TEST PASSED
RGBA8_misaligned…
Avg. time: 6.040906 ms / Copy throughput: 0.007708 GB/s.
TEST PASSED
LA32_misaligned…
Avg. time: 3.652125 ms / Copy throughput: 0.012750 GB/s.
TEST PASSED
RGB32_misaligned…
Avg. time: 3.665812 ms / Copy throughput: 0.012701 GB/s.
TEST PASSED
RGBA32_misaligned…
Avg. time: 3.008281 ms / Copy throughput: 0.015479 GB/s.
TEST PASSED
Testing aligned types…
RGBA8…
Avg. time: 5.261281 ms / Copy throughput: 0.008851 GB/s.
TEST PASSED
I32…
Avg. time: 4.983344 ms / Copy throughput: 0.009344 GB/s.
TEST PASSED
LA32…
Avg. time: 3.626406 ms / Copy throughput: 0.012841 GB/s.
TEST PASSED
RGB32…
Avg. time: 2.792156 ms / Copy throughput: 0.016677 GB/s.
TEST FAILED
RGBA32…
Avg. time: 2.771594 ms / Copy throughput: 0.016801 GB/s.
TEST PASSED
RGBA32_2…
Avg. time: 2.867188 ms / Copy throughput: 0.016236 GB/s.
TEST PASSED
Shutting down…

Bandwidth Test:
Running on…
device 0:PTX Emulator
Quick Mode
Host to Device Bandwidth for Pageable memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 1540.9

Quick Mode
Device to Host Bandwidth for Pageable memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 1541.0

Quick Mode
Device to Device Bandwidth
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 3103.3

&&&& Test PASSED

Press ENTER to exit…

Histogram64:
Initializing data…
…allocating CPU memory.
…generating input data
…allocating GPU memory and copying input data
Running GPU histogram (1 iterations)…
histogram64GPU() time (average) : 30576.292969 msec //3.118999 MB/sec
Comparing the results…
…histogramCPU()
histogram64CPU() time : 114.170998 msec //835.303480 MB/sec
Total sum of histogram elements: 100000000
Sum of absolute differences: 0
TEST PASSED
Shutting down…

The emulator has the following characteristics 1) cudaMemcpy bandwidth is similar to native, 2) PTX JIT compiler overheads are about 20ms the first time a kernel is called, 3) synchronous kernel launch overhead is about 12us, 4) 100-300x slowdown over equivalent natively compiled CPU code. More to come as I validate more of the SDK. Comments and questions are welcome.

Hello Gregory,

Your work has already been discussed in this forum before… But it was about PTX to CELL broadband engine or sthg like that.

I am curious to know this. You have said that your implementation fully supports the concept of WARP. i.e. 32 threads all executing the same instruction at the same point of time.

So, this means – programs linked against your libraries will behave exactly in the same way as it would do in the GPU - Is that right?
And so, this is quite different from deviceemu – wherein, the results could be different because of in-complete simulation.

Hi Sarnath,

Yes, over fall 2008 we developed a PTX to Cell translator. It was a research effort only where we tried to determine how well architectures other than GPUs could be targeted by the PTX virtual ISA. Based on positive results from that project, we have begun a more in depth implementation with the goal of being able to dynamically move code across different processors based on performance characteristics of an application on a given type of processor. This PTX emulator is a starting point that allows us to run CUDA applications on any processor with a C compiler and validate other back end compilers or translators that we develop.

They will not behave identically, but it will be closer than deviceemu. In short it will allow you to assume that warps of 32 threads operate in lock step as you would in current NVIDIA GPUs.

Our implementation conforms to the machine model specified by PTX. This machine model has a concept of a thread, warp (SIMD unit of threads), CTA (unit of warps), and a kernel (unit of CTAs). For performance reasons, the specification does not strictly constrain the order in which warps are scheduled on hardware. This means that a GPU is free to split/recombine warps dynamically, context switch warps to hide memory latency, or even pick a random scheduling order. People can play tricks to avoid synchronising by sharing data within a warp assuming that the warp size is 32 (which breaks using deviceemu which has a warp size of 1). NVIDIA could release a GPU in the future with a warp size other than 32 that would conform to PTX and break those apps. You would need a full architecture simulator that exactly modeled a specific GPU to get the same order (Barra by Sylvain Collange would probably have a closer schedule to what is actually run on a GPU).

That being said, our emulator uses a warp size of the CTA size (1024) and uses program dominator analysis to determine potential points for splitting and recombining warps. This means that all threads within a CTA operate in lock step unless there is a branch that is not taken by all threads in the warp, and that warps will automatically be recombined as soon as possible without the need for explicit __syncthreads() calls.

Hello Gregory,

Is your research group looking @ OpenCL as well?

OpenCL is closer to CUDA. So your work can really solve the heterogeneous computing issues as well.

From the quote above and from what you said after that – It occurs to me that your emulator(or simulation) will give exact results as in hardware – provided the program is correct. How race-conditions are exposed – might actually differ.

CUDA programmers should NOT assume about the underlying hardware and how it executes… – for example – the order in which the WARPS are scheduled. So, well written CUDA programs should work fine and produce correct results on your emulator. Am I right?

THANKS for your time!

Best Regards,

Sarnath

Sarnath

Yes, we are trying to look at the entire programming stack from the high level language down to the processor specific ISA. The very long term goal is to start with either a CUDA or OpenCL (or another language) program and compile down to an execution model (we have a project working on this http://www.ece.gatech.edu/research/labs/casl/harmony.html ) that expresses a program as a set of kernels with data flow and control flow where each kernel is compiled to a virtual instruction set like PTX or the source is inlined directly as in OpenCL. The application is executed by a runtime that examines the resources in the system and maps kernels onto different cores. The use of a virtual instruction set or inlined source code allows the runtime to dynamically recompile a kernel for any processor in the system. The runtime determines which kernels can be executed in parallel rather than forcing the application developer to write code to launch kernels on different processors if they exist.

We are using CUDA exclusively now as there have not been any OpenCL compilers released yet.

Yes, as long as there are no race conditions, the emulator should produce identical results to a native GPU program. It will actually have fewer race conditions than a GPU because an entire CTA runs in lock step rather than just a warp of 32 threads. I agree that programmers shouldn’t assume anything about the underlying hardware, but I have seen a couple production libraries (CUDPP for example) that assume a warp size is 32.

Regards,

Greg

Hi Greg,

OpenCL would have been 1 step forward if it had released an intermediate representation common to all platform.

That would have made life easier.

May be, one should consider Java byte code with some parallel extensions to represent a parallel kernel in an intermidate form – I am ignorant enough to make stupid suggestions… But just my 2 cents.

Yeah, Thats pretty good then. Many programs assume 32 as warp-size and I dont think it is a big issue after all.

But interestingly, one of the authors of CUDPP, even advocated that programmers should use the warp-size present in “cudaGetDeviceProperties” and not assume 32.

Thanks for your time,

Good luck on your endeavours!

Best Regards,

Sarnath

Most of the recent research involving virtual instruction sets has been pioneered by LLVM which has the general philosophy of retaining as much information from the high level language as possible in the ISA. PTX is extremely similar to LLVM, but it supports the concept of a thread hierarchy which LLVM does not. Java’s philosophy started out by compiling high level languages to byte code and executing the byte code via emulation. This was really slow and is where java got a reputation of performing poorly. Now, most JVMs either translate byte code to native code or dynamically compile java source down to native code and execute that. So all three approaches have become very similar recently. Out of all these, we like PTX because of it is inherently very highly multithreaded.

Just wanted to clarify that CUDPP uses the DEVICE_EMULATION macro to select between code for a warp size of 1 and a warp size of 32. Compiling without defining the DEVICE_EMULATION and linking against an implementation that did not assume a warp size of 32 would introduce race conditions. From what I can tell they don’t use the cudaGetDeviceProperties function.

But anyways, I agree that most of the time it isn’t a big deal at all.

Glad to help,

Greg

A visual demonstration of the emulator running. It was actually pretty easy to add openGL support…

http://www.gdiamos.net/classes/translator/OcelotOpenGL.avi

Left is the emulator right is the GPU on my laptop.

What codec did you use? It won’t play on my machine (Vista x64).

I used mencoder with the mpeg4 codec (DivX). If that still doesn’t work for you, I have added mpeg2 ( http://www.gdiamos.net/classes/translator/…penGL_mpeg2.avi ) which should be more standard and ( http://www.gdiamos.net/classes/translator/OcelotOpenGL.ogv ) for people using linux.

vlc seems to work on the avi file. neat project.

I know many other people have said this but this is incredible work…

I have been wandering from forum to forum, thread to thread and it usually ends with either a big question mark or an incorrect assessment. That is, equating emulation reproducing the same resulting calculations with emulation that also gives the behavior of the graphics device complete with memory transfers, resource starvation, wait times etc.

I am a physics-student-researcher/developer and many of my senior colleagues have been asking me about implementing GPGPU. They of course want something solid before we spend thousands of dollars on a workstation.

I want to flag this next statement for NVIDIA as well this thread, hopefully they will see the immediate value to them in your project…

TO NVIDIA: The main thing preventing purchase of Kepler hardware (especially the high end Tesla) is the inability to prove performance gains in addition to accurate calculations.

This project is EXACTLY what I know I am looking for and thank you very much for all the work you have already done on it. Especially something this demanding… understanding implementation at the assembly level gets very little emphasis these days. The result is a tremendous burden on a limited number of developers to do highly mission critical work that allows for almost everything else.