Thrust v1.0 release A high-level C++ template library for CUDA

We are pleased to announce the release of version 1.0.0 of Thrust, an open-source template library for data parallel CUDA applications featuring an interface similar to the C++ Standard Template Library (STL). Thrust provides a flexible high-level interface for GPU programming that greatly enhances developer productivity while remaining high performance. Note that Thrust supersedes Komrade, the initial release of the library, and all future development will proceed under this title.

The thrust::host_vector and thrust::device_vector containers simplify memory management and transfers between host and device. Thrust provides efficient algorithms for:

    sorting - thrust::sort and thrust::sort_by_key

    transformations - thrust::transform

    reductions - thrust::reduce and thrust::transform_reduce

    scans - thrust::inclusive_scan and thrust::transform_inclusive_scan

    And many more!

Please refer to http://code.google.com/p/thrust/wiki/Documentation for a complete listing and join us on the thrust-users discussion group.

As the following code example shows, Thrust programs are concise and readable.

[codebox]

#include <thrust/host_vector.h>

#include <thrust/device_vector.h>

#include <thrust/generate.h>

#include <thrust/sort.h>

#include

int main(void)

{

// generate random data on the host

thrust::host_vector h_vec(20);

thrust::generate(h_vec.begin(), h_vec.end(), rand);

// transfer to device and sort

thrust::device_vector d_vec = h_vec;

thrust::sort(d_vec.begin(), d_vec.end());

return 0;

}

[/codebox]

Thrust provides high-level primitives for composing interesting computations.

This example computes the norm of a vector.

[codebox]

#include <thrust/transform_reduce.h>

#include <thrust/functional.h>

#include <thrust/device_vector.h>

#include <thrust/host_vector.h>

#include

// square computes the square of a number f(x) -> x*x

template

struct square

{

__host__ __device__

T operator()(const T& x) const { 

    return x * x;

}

};

int main(void)

{

// initialize host array

float x[4] = {1.0, 2.0, 3.0, 4.0};

// transfer to device

thrust::device_vector<float> d_x(x, x + 4);

// setup arguments

square<float>      unary_op;

thrust::plus<float> binary_op;

float init = 0;

// compute norm

float norm = std::sqrt( thrust::transform_reduce(d_x.begin(), d_x.end(), unary_op, init, binary_op) );

std::cout << norm << std::endl;

return 0;

}

[/codebox]

Download Thrust here, and check out the tutorial to get started.

Thrust is open source under the Apache 2.0 license and available now @ http://thrust.googlecode.com

How about multi-GPU progrmaming? How easy is that with Thrust?

Well… Congratulations on this. Good to see this.

Congratulations guys–coincidentally, I used Komrade (er, Thrust I guess now) for the first time today :)

Coincidentally, for me as well, I was just pondering over designing a library that has these functionalities as a part of it. Good to see I am not alone.

On browsing through your doc, I realize that multi-GPU is a layer above what you have done. So, this should work cool.

Btw,
Do you support vector of vectors and N-D arrays?

–edit–
I think the vector of vectors, operator overloading method of accessing data and other C++ gymastics are the real hurdle of using C++ directly in HPC.
These methods cause the physical representation of data to be hidden resulting in making mem copies to HPC devices a nightmare.
Thats probably a reason why OpenCL stuck to ‘C’.
But if you could get over this barrier with thrust (I dont know if we can) , I think it will be useful.
Am I right in my thinking? What do you think?

We do plan to support vectors spanning multiple GPUs and plan to extend our algorithms to operate transparently on such vectors (e.g. reduce(), sort(), etc.). I’m not sure whether we can do this with CUDA right now, or whether it will require a more flexible cudaSetDevice() (i.e. one that can be interleaved with kernel calls). My understanding is that the driver API supports this already but the higher-level runtime API does not (yet).

You can use Thrust to implement multi-GPU algorithms, but Thrust does not provide any additional support for that usage right now.

Glad to have you aboard! :)

Currently, there’s no support for vector of vectors or multi-dimensional arrays. However, Thrust should help you build such data structures (N-D arrays at least). For example, I’m currently developing a small library for sparse linear algebra that uses Thrust extensively.

Regarding your second point, I think complex data structures with multiple levels of indirection (pointers) present a challenge for any high-performance parallel architecture. In many cases I think the solution will be to replace the more “natural” data structure or algorithm with one more amenable to parallelism. For example, the standard usage of the STL set and map data structures, which are based on binary trees, are not a good match for many-core processors that have hundreds or thousands of threads that need to modify the data structure simultaneously. A more “parallel friendly” implementation might instead store the elements in flat arrays (thrust::device_vector)and use sorting (thrust::sort) and a specialized compaction operation (thrust::unique) to achieve the same result. In general, there are many techniques to “flatten” irregularity out of algorithms so that they can be efficiently executed on wide parallel processors (e.g. segmented scan).

I don’t mean to suggest that all irregularity can be removed from these algorithms or data structures. It’s just that there are often efficient methods to cope with such irregularity in a way that exposes a high degree of fine-grained parallelism that maps well to architectures like the GPU.

Nbell,

Thanks for your replies!

I somehow think thrust can be compared to “brooks++” (or brooksGPU). In the sense that – it completely abstracts the GPU details and exposes device-vector operations.

Also regarding vectors spanning multiple devices – I dont think that will be needed. Thrust in its current form can be used in multi-GPU code without any problem, I think.

Thanks!

Best REgards,
Sarnath

Congrats from me too.
multi GPU code will be a big boon in our work

Here are some performance results for thrust::sort() with various data types. Thrust uses radix sort[1] for sorting all standard C data types (char, int, float, double, etc.). Although radix sort only sorts unsigned integer types, one can apply some bit-twiddling tricks to make it work for signed integers (e.g. negative ints) and floating point values. Our radix sort code was derived from Mark Harris’ implementation, which should be among the fastest GPU sorting codes available.

The results in the attached image (thrust_sort.png) are sorting 23.7M Keys on a GTX 280. Since the cost of radix sort is dependent on the size of the data type, we expect performance to be fastest for char and slowest for double (and 64-bit integers, not shown).

The previous sorting results are for keys with random values. In practice, we often sort keys with values in a predefined range. For example, if we have a set of 32-bit unsigned integers whose values are all less than 1M, then we can avoid some work inside radix sort. Our radix sort implementation uses 4-bit histograms, so if all keys are less than 1M then we only need to sort the lower 20 bits of the keys. This saves a considerable amount of work (20 bits vs. 32 bits), so it is worthwhile to exploit this optimization.

When you call thrust::sort() on any integer type, it will quickly compute the maximum key in the array first and determine how many bits need to be sorted. The cost of this operation (which is just a maximum reduction) is negligible compared to even the fastest sort, so the overhead is minimal.

The figure in image (thrust_key_bits.png) shows sorting performance for varying numbers of key bits. When KeyBits <= 4 (i.e. all values are less than 16) sorting performance is approximately 1275 MKey/s which is 7x faster than the full 32-bit sorting rate of approximately 175 MKey/s. If the maximum value is less than 1M (KeyBits <= 20) then the sorting rate is 275 MKey/s, or 57% faster than the full 32-bit sort.

Anyway, the purpose of this exercise is to show that Thrust implements many optimizations behind the scenes to provide better performance. All this happens without any programmer intervention, you just call thrust::sort() and you’re done.

When I get a chance I’ll show similar results for thrust::reduce() and thrust::fill() which work around non-optimal G80/GT200 memory coalescing behavior.

[1] http://code.google.com/p/thrust/wiki/FurtherReading
sort_key_bits.png
thrust_sort.png

Thanks a lot, I will be checking out this library to see if I can use it to accelerate some physics simulations with it. I plan to implement my own physics code based on Ian Millington’s book “Game Physics Engine Development”. Originally I intended to write my own GPU vector classes, but Thrust looks like it might be better than anything I could come up with.

One wish of mine: an alternative CPU based Thrust code path (SSE, SSE2 anyone?) so programs can also run without a GPU.

Christian

Very cool! Let us know how that goes.

Yep, that’s definitely something we’re considering for Thrust v2.0. The CPU backend could also support multicore systems using OpenMP or the like.