I want your CUDA application traces

I recently added the ability to checkpoint CUDA applications and replay individual kernels through Ocelot. Rather than using this for reliability, the intention is to facilitate the automatic creation of CUDA benchmarks and regression tests.

I invite anyone who has an application that they wouldn’t mind contributing to checkout Ocelot from here: http://code.google.com/p/gpuocelot/source/checkout

Build and install the trunk, link your application against Ocelot, setup a configure.ocelot file to enable trace capture, run your application, and post the resulting traces to this thread. Ideally you should use the most recent version of NVCC targeting the highest shader model (NVCC 4.0, -arch sm_23). Capture and replay should work on any Ocelot device (emulated, llvm, nvidia, or ati).

An example configure.ocelot checkpoint section is as follows:

checkpoint: {

		enabled: true,

		path: "../../tests/traces/ptx2.3/basic/",

		prefix: "UnstructuredMandelbrot_",

		suffix: ".checkpoint"


These traces create a snapshot of memory before and after kernel execution so they can become extremely large. Fortunately they are also typically very amenable to compression, so please compress your trace before posting. I have a 900MB trace that compressed down to 64KB using bzip2. If you have something huge that you still want to contribute feel free to host it yourself and post a link to it.

Periodically I will consolidate the traces posted to this thread into regression test/benchmark suites and post them on the Ocelot website. Anyone is free to download and use them.

I’ll start out by posting a trace from a SQL inner-join benchmark. Join Trace

Some Caveats:

  1. Embedded pointers within global memory are not currently supported. If you do this, your application will produce a trace that will silently fail during execution. I don’t plan on adding support for this in the near future.

  2. The trace format captures the memory state before and after kernel execution as well as the PTX assembly code for each launched kernel. If you don’t want to release your source code, but are comfortable releasing a binary and a checkpoint of your memory state, this may be appealing to you.

EDIT: Updated to note that the trace capture branch is now merged with the trunk.

I’d be happy to submit a trace of hoomd on a typical benchmark. How fast are we talking about file size growing here? Given your description, it seems that the file size grows linearly with the number of kernel launches, correct? Even a 1-second long benchmark of hoomd contains 10,000+ kernel calls.

The file size grows as (memory_allocated_on_device * number_of_kernel_launches); one trace file will be created for each kernel launch. For applications that launch the same kernel multiple times on different data, it is probably sufficient to include traces for only the first (few) kernel(s) in order to be able to characterize the behavior of the application.

Here are traces from the CUDA sdk, and the parboil benchmarks:


All in all there are about 1400 kernel traces. Be warned though, the gzip files are pretty big (100MB-1GB) and my upload bandwidth is not the best. I’ll continue to update that site with additional traces.