Ocelot 1.1.560 Released An open-source reimplementation of CUDA for GPUs and CPUs

We are pleased to announce the release of Ocelot 1.1.560, a dynamic compilation framework for PTX and open-source reimplementation of the CUDA runtime. Ocelot supports emulation of PTX kernels, native execution on CPUs, and native execution execution on NVIDIA GPUs. It also includes a comprehensive and extendible back end optimizing compiler for PTX.

Ocelot can be downloaded directly from http://gpuocelot.googlecode.com/files/ocelot-1.1.560.tar.bz2 , or you can visit http://code.google.com/p/gpuocelot/ for documentation and source code access.

Version 1.1 includes various bug fixes as well as several new features:


Three target devices


PTX 1.4 Emulator


Memory Checker

 		-out of bounds accesses

 		-misalgined accesses

Shared Memory Race Detector

PTX 1.4 JIT Compiler and CPU Runtime


Execute CUDA programs natively on CPU targets without emulation

Support for any LLVM target

*Requires LLVM 2.8svn

Can achieve over 80% of theoretical peak FLOPs/OPs on CPU targets



Recompiles PTX kernels using the NVIDIA Driver


Reimplementation of the CUDA Runtime


Device Switching

	- The same host thread can simultaneously control multiple devices.

New Memory Model

	- Device allocations are shared among all host threads[/b]




Extendible optimization pass interface for PTX

	- Per-Block, Per-Kernel, Per-Module passes [/b]

Trace Generator


Extendible interface for instrumenting PTX kernels

Can examine the complete system state after each instruction is executed

	i) Registers

	ii) Memory Accesses

	iii) Last instruction executed

	iv) Thread activity mask

Open Projects for Ocelot 1.2

  1. Full PTX 2.0 support

  2. AMD GPU Devices

  3. SIMT on CPU vector units

  4. Asynchronous kernel execution

  5. Multi-threaded emulator device

Are you creating one context per device per process, then?

Awesome work, Greg, congrats! The multi-gpu stuff is really exciting.

Are sin() & friends implemented?

Yeah, on the first cuda call we create one context per device and share each context across all host threads.

Yes, all ptx 1.4 instructions except for trap call and pmevent are supported, including transcendentals and texures.


Could you please elaborate as to why would one want to share all contexts across all host threads? why not 1:1 mapping? what’s the added value?



The semantics of most threading models include a flat memory space that is shared among threads in the same process. This is useful for lightweight communication between threads because treads can communicate by simply passing pointers to shared data structures. A 1 to 1 mapping causes each thread to have a separate address space, making threads in the same process similar to separate processes. With shared contexts, you can do things like allocating memory with one thread, then passing a pointer to another thread which launches a kernel that modifies it.