CUDA Benchmark Suite Suggest Apps

I am trying to put together a preliminary benchmark suite of CUDA applications to address the current inability of researchers to evaluate architecture and compiler optimizations for GPUs; the current state of the practice is to run the SDK examples and any hand coded applications that one has lying around. It is hard to determine how effective a given optimization is or to verify the results reported by a research group because no one has access to the same set of applications, and it is not clear if anyone actually cares about the applications being tested.

The ideal workload should be representative of the most important applications for general purpose computing using CUDA in the same way that SPEC/PARSEC/SPLASH/MediaBench provide a basis for comparing optimizations for CPUs. It should draw the most representative sections from widely used applications and algorithms written in CUDA. All of the applications should be open source with a free for commercial use license.

Here is a preliminary list of application domains that I propose are important drivers for a benchmark suite for CUDA applications:

1) High Performance Computing

	A) Linear Algebra

	  i) QR decomposition (?)

	  ii) SV decomposition (?)

	B) Solvers

	  i) Some PDE solver (?)

	  ii) Graph Partitioning (?)

	  iii) K-Nearest Neighbor (?)

	C) Signal Processing

	  i) FFT/DCT (CUFFT/VSIPL)

	  ii) Convolution/FIR/IIR Filters (VSIPL)

  2) Media/Graphics/Image Processing

	A) Video Encoding/Decode

	  i) H264 (?)

	B) Graphics

	  i) Ray Tracing (?)

	  ii) others?

  3) Financial / Data Processing

	A) Option Pricing / Financial Simulation

	  i) Black Scholes (many examples)

	  ii) Collateralized Debt Obligation (?)

	  iii) Portfolio Risk Analysis (?)

	B) Compression/Encryption

	  i) File compression (DEDUP?)

	  ii) AES?

	C) Database operations

	  i) Joins (Relation Joins on GPUs)

	  ii) Reductions/Prefix Sum (Many Examples)

	  iii) Sorting (GPUQuicksort)

  4) Machine Learning / AI

  5) Simulation

	A) Physics

	  i) Particles/Fluids/Aerodynamics (PhysX maybe)

	B) Digital/Analog Hardware

	  i) VHDL/Verilog/Netlist simulation (?)

	  ii) E&M Wave Propagation (?)

	  iii) SPICE/etc (?)

	C) Chemistry/Biology

	  i) Molecular Dynamics (NAMD)

	D) Astrophysics

	  i) Nbody Simulation (?)

If anyone has any suggestions as to domains that I have missed or specific CUDA apps that fit into a particular domain, please let me know. If you are a researcher developing optimizations for CUDA applications, what would make it easier to use such a benchmark suite for any results that you publish? If you are an application developer, would you be willing to contribute either full CUDA applications or representative benchmarks in the hope that future improvements to GPUs or compilers will accelerate your application?

If you know of any open source CUDA applications that fit into these domains, would you please reply with a link?

Do you mean to benchmark GPU vs GPU, or GPU vs CPU?
If you’re doing the latter, you have to spend some effort to make sure the same problem is being solved in both domains, and document what your assumptions are (for example, does PCIe data transfer time count, etc).

But whatever the goal, it would indeed be interesting to have sample apps in each one of those categories you list.
Your graphics examples could likely be expanded into 40 more types ranging from tesselation to procedural modeling to photon mapping to object simplification.

One idea, you may look in the CUDAzone app list and see what you can fill in from there… there’s both apps and published papers which can fill in some examples for you and suggest new ones.

Eventually you may wish to think about OpenCL comparisons as well, though of course that’s way early now.

The idea is to create a CUDA benchmark that can be run on any ISA with a CUDA compiler. My group is working on a back end compiler/translator from CUDA/PTX to x86/others for example. Other projects like Barra and GPGPU-Sim

have simulators for architectures that are closer to NVIDIA GPUs, but could easily add additional features or change timing parameters to resemble something new.

Most mature languages (C/C++, Fortran, Java, SQL) have their own benchmarks and in general comparisons are made between compiler targets not between the same applications written in different languages-- it is just too hard to make sure that each version of the application is equally well-written. My hope is that CUDA will eventually become cross platform at least across GPUs and x86 CPUs, and in that case the most fair comparison would be between two instances of the same CUDA application running on two different processors.

:) My background in graphics is really not very strong. I would welcome any suggestions that people have.

CUDAZone is a good starting place, and I will see what I can find.

This is will also be very useful for benchmarking the relative speeds of different hardware devices. It would be great to have a spreadsheet of runtimes for each of these tests on a variety of CUDA devices as a guide to help developers gauge performance changes moving from one CUDA device to another. Applications with different balances of memory I/O and floating point operations should cover the spectrum of possibilities pretty well.

I think a good solution for a bunch of these would be to generate a decent amount of test data for each area, store the data in a well-known/documented format, compress it as much as possible (7-zip probably), then make it freely available. You could put it together and then make a .torrent file for it so you weren’t always serving up huge amounts of data for free.

Then, anyone who wanted to benchmark their kernel against the data would just decompress the data they wanted and process it to see what time/accuracy they get.

Obviously, you’d need to pre-compute the solutions to the operations on the data so they’d have something to compare their results against for accuracy. Maybe that would just be a separate DVD image, since some people will already have fairly accurate kernels and just want to time their code.

From that point, someone could write some kind of common benchmark with CUBLAS/CUFFT/etc. and let it run, then automatically submit to a web service for compilation and display on a website.

Hi,
You forgot the oil&gas field :) tons of tons of $$$$ there ;)
BTW - when you say database, out of curiosity, does anyone have knowledge about people/applications that use
GPUs for database enhancements? this is something I’ve been wondering a bit for some time now.

thanks
eyal

I’ve thought about trying to port SQLite to CUDA for a while, but I’ve never gotten around to it. Now that OpenCL is out though, I think that would be a better choice. That would actually be a pretty cool, and very useful programming contest, if nVidia ever sponsored another one.

I don’t know about anyone that has released database-related code for CUDA though. I suppose it is possible to port some parts of the database code, but some of it is inherently serial so you’d still be stuck with that bottleneck.

Anyway, as far as the linear algebra/convolution tests go…if I get some free time in the next couple of weeks, I’ll try to put together some kind of dataset and compile it into a DVD image and torrent it (like I wrote above). That way we can run some good benchmarks on various GPUs, various libraries on the same system/GPU, compare to CPU-based libraries and so forth.

Hi,

What parts of a DB would you port? I think searches/joins/data-wharehouse activities etc which might be the most time consuming

would require the relevant records to be copied to the GPU, wouldnt that take ages???

thanks

eyal

I wrote a long reply back, but the forum maintenance killed it…

Short version…you could get speedups for aggregate functions (reductions) like SUM(), MAX(), etc. You could also copy parts of the record (e.g. an id and one or two columns) to the GPU to do sorting and joining. I’d bet that you could even implement some/part of this functionality on top of CUBLAS, though obviously if you were really looking for max performance (who isn’t!) you’d want to custom build all of the kernels involved.

I suggested SQLite because the code is public domain (totally free), and very widely used. It’s also used semi-often in web environments as a sort of cache, and the entire database is stored in memory, so if you’re copying records to the GPU, you wouldn’t have to read them from the disk as well.

check out the berkeley view dwarfs
http://view.eecs.berkeley.edu/wiki/Dwarf_Mine

Thanks for the time :)

Another followup if I may please. Large/Huge databases obviously can’t fit in the RAM (I’m talking about bank tables for example) and data will have to be brought from disk.

Still this would happen anyway, so not much gain there.

In case we do/dont have the data in RAM, we’ll have to move some of the data to the GPU over PCI - you think that sum/max etc are still going to show performance gain with the GPU?

What else do you think can be done in the DB area beside those functions (I guess that statistical functions might be relevant if sum/max is. Maybe data warehouse operations? reports? )

thanks

eyal

I am actually doing a project with a local company to port their back-end database engine to CUDA. I don’t want to get into a whole lot of details here because the work is not published and probably under NDA, but we are seeing very promising speed ups on some of the low level query operations (smaller databases that fit into memory are obviously faster, but there are some neat tricks you can do that take advantage of the higher bandwidth GPU dram by partitioning databases into segments that can fit in GPU memory and processing them one at a time). This paper provides some background reading that is in the public domain http://gpgpu.org/tag/papers/page/4 . I will hopefully be able to publish something like a tech report at the end of the summer and contribute an open source version of the TPC-H benchmark suite (http://www.tpc.org/tpch/) doing some or all operations using CUDA, but that is very optimistic.

In any case, I think that it is interesting enough to be represented in this benchmark suite.

Gregory, Thanks for the link.

The article is a bit old, I guess the performance numbers are higher now :)

The DB is an area I didnt think possible to gain performance with the GPU. I’m glad I was mistaken.

thanks

eyal

Hi Greg,

Your description of the current state of the practice is so accurate… :)
This is an excellent idea! Be sure I will use this benchmark suite in my own projects.

I am currently looking at applications on bioinformatics (DNA sequencing, matching…), such as GPU-HMMER. I think this could appear in your list.

Perhaps we could also write some benchmarks for this:

http://shootout.alioth.debian.org/

It would be a good way for people to visualize CUDA’s strengths and weaknesses…

I just found this today…I wonder if we could port some of the benchmarks to CUDA (especially when the FORTRAN compiler comes out):

http://www.spec.org/cpu2006/CFP2006/

EDIT: I just saw that Gregory mentioned SPEC in his original post. Oh well…maybe it is still a relevant link :/