Ocelot 1.1.560 Released An open-source reimplementation of CUDA for GPUs and CPUs

Gregory_Diamos · May 2, 2010, 2:51pm

We are pleased to announce the release of Ocelot 1.1.560, a dynamic compilation framework for PTX and open-source reimplementation of the CUDA runtime. Ocelot supports emulation of PTX kernels, native execution on CPUs, and native execution execution on NVIDIA GPUs. It also includes a comprehensive and extendible back end optimizing compiler for PTX.

Ocelot can be downloaded directly from http://gpuocelot.googlecode.com/files/ocelot-1.1.560.tar.bz2 , or you can visit http://code.google.com/p/gpuocelot/ for documentation and source code access.

Version 1.1 includes various bug fixes as well as several new features:

[list=1]

[*] Three target devices

[list=1]

[*] PTX 1.4 Emulator

[list=1]

[*] Memory Checker

 		-out of bounds accesses

 		-misalgined accesses

[*] Shared Memory Race Detector

[*] PTX 1.4 JIT Compiler and CPU Runtime

[list=1]

[*] Execute CUDA programs natively on CPU targets without emulation

[*] Support for any LLVM target

[*] *Requires LLVM 2.8svn

[*] Can achieve over 80% of theoretical peak FLOPs/OPs on CPU targets

[*] NVIDIA GPU JIT

[list=1]

[*] Recompiles PTX kernels using the NVIDIA Driver

[b]

[*] Reimplementation of the CUDA Runtime

[list=1]

[*] Device Switching

	- The same host thread can simultaneously control multiple devices.

[*] New Memory Model

	- Device allocations are shared among all host threads[/b]

[b]

[*] PTXOptimizer

[list=1]

[*] Extendible optimization pass interface for PTX

	- Per-Block, Per-Kernel, Per-Module passes [/b]

[*] Trace Generator

[list=1]

[*] Extendible interface for instrumenting PTX kernels

[*] Can examine the complete system state after each instruction is executed

	i) Registers

	ii) Memory Accesses

	iii) Last instruction executed

	iv) Thread activity mask

Open Projects for Ocelot 1.2

Full PTX 2.0 support
AMD GPU Devices
SIMT on CPU vector units
Asynchronous kernel execution
Multi-threaded emulator device

tmurray · May 2, 2010, 7:46pm

Are you creating one context per device per process, then?

JaredHoberock · May 2, 2010, 8:49pm

Awesome work, Greg, congrats! The multi-gpu stuff is really exciting.

jma · May 3, 2010, 1:07am

Are sin() & friends implemented?

Gregory_Diamos · May 3, 2010, 1:38am

Yeah, on the first cuda call we create one context per device and share each context across all host threads.

Gregory_Diamos · May 3, 2010, 1:43am

Yes, all ptx 1.4 instructions except for trap call and pmevent are supported, including transcendentals and texures.

eyalhir74 · May 3, 2010, 6:47am

Hi,

Could you please elaborate as to why would one want to share all contexts across all host threads? why not 1:1 mapping? what’s the added value?

thanks

eyal

Gregory_Diamos · May 3, 2010, 12:05pm

The semantics of most threading models include a flat memory space that is shared among threads in the same process. This is useful for lightweight communication between threads because treads can communicate by simply passing pointers to shared data structures. A 1 to 1 mapping causes each thread to have a separate address space, making threads in the same process similar to separate processes. With shared contexts, you can do things like allocating memory with one thread, then passing a pointer to another thread which launches a kernel that modifies it.

Topic		Replies	Views
Ocelot 1.0 Alpha Release High Performance GPU and Multi-core CPU targets CUDA Programming and Performance	27	60284	January 1, 2010
Ocelot Pre-Release CUDA Programming and Performance	1	933	February 8, 2011
How to share GPU memory from different host threads? CUDA Programming and Performance	6	2440	July 14, 2010
PTX Emulator Released CUDA Programming and Performance	32	8777	July 15, 2009
Ability to run PTX directly CUDA Programming and Performance	2	4452	November 11, 2009
share pointer devices achross threads CUDA Programming and Performance	0	5451	February 10, 2011
Simple Question! Can CUDA code be run on CPU CUDA Programming and Performance	9	30267	October 19, 2023
Ocelot - Finding the PTX (Cat) inside the executable (Bag) Is Ocelot Dependent on the CUDA version? CUDA Programming and Performance	29	11777	October 8, 2010
NVIDIA has hade a huge mistake with HW debugger Single-GPU debugging not supported and no emulation& CUDA Programming and Performance	34	6497	August 7, 2010
Ocelot PTX Debugger CUDA Programming and Performance	5	8101	July 23, 2010

Ocelot 1.1.560 Released An open-source reimplementation of CUDA for GPUs and CPUs

Related topics