Ocelot Pre-Release

I’m pleased to announce that we are gearing up for a few new Ocelot releases. The first one is 1.3.967.

Features in the stable 1.3.967 release include:

  • Support for PTX 1.4
  • PTX Emulator
  • PTX to LLVM to x86/ARM JIT
  • PTX to PTX to NVIDIA JIT
  • Memory Checker
  • Memory Race Detector
  • Interactive Debugger
  • Prototype AMD Device Support
  • A compiler optimization pass framework for PTX.
  • Various instrumentation passes.
  • A complete reimplementation of the Cuda Runtime.
  • Numerous bug fixes and performance improvements.

A packaged pre-release can be downloaded from ocelot-1.3.967 . I am going to leave this up for a week before migrating it to the main project website. Please post back if there are any problems with it.

This is the last release that will support devices that require PTX 1.4, it is intended to provide a stable final version including all bug fixes up to this point. Rather than maintaining multiple versions, we plan to evolve Ocelot with the development of PTX, and drop support for older versions as newer versions come online. There will be a final release supporting the older versions of PTX, but support for older versions will not be rolled into the newer versions to limit the amount of testing that we need to do.

We are also gearing up for a PTX 2.x release that will go out by the end of the week that will have the more interesting new developments. I’ll post back with more details about it.

The full releases are now available on the main ocelot website, or directly at 1.3.967 and 2.0.969 .

Here is a feature list for 2.0.969

 - PTX 2.2 and Fermi device support.

   a) Floating point results should be within the ULP limits in the PTX ISA manual.

   b) Over 500 unit tests verify that the behavior matches NVIDIA devices.

 - Four target device types:

   a) A functional PTX emulator.

   b) A PTX to LLVM to x86/ARM JIT.

   c) A PTX to CAL JIT for AMD devices (beta).

   d) A PTX to PTX JIT for NVIDIA devices.

 - A full-featured PTX 2.2 IR:

   a) An analysis/optimization pass interface over PTX.

     i)   Control flow graph.

     ii)  Dataflow graph.

     iii) Dominator/Postdominator trees.

     iv)  Structured control tree.

   b) Optimizations can be plugged in as modules.

 - Correctness checking tools:

   a) A memory checker (detects unaligned and out of bounds accesses).

   b) A race detector.

   c) An interactive debugger (allows stepping through PTX instructions).

 - An instruction trace analyzer interface:

   a) Allows user-defined modules to receive callbacks when PTX instructions are executed.

   b) Can be used to compute metrics over applications or perform correctness checks.

 - A CUDA API frontend:

   a) Existing CUDA programs can be directly linked against Ocelot.

   b) Device pointers can be shared across host threads.  

   c) Multiple devices can be controlled from the same host thread (cudaSetDevice can be called multiple times).

There are also some interesting features that are coming online in the development branches:

  1. Remote devices (start up an Ocelot server and attached devices will be visible to CUDA applications on client nodes)

  2. Software warp formation (use CUDA to program your SSE/AVX units)

  3. PTX instrumentation. Allows arbitrary code to be inserted into CUDA kernels as they are launched. So far we have recorded hot paths, CTA schedules, and load balance.

  4. The AMD device is becoming more stable every day (thanks completely to Rodrigo Dominguez). He has gotten about half of the CUDA SDK examples to execute on an AMD GPU.