Thrust v1.2 release A high-level C++ template library for CUDA

We are pleased to announce the release of Thrust v1.2, an open-source template library for developing CUDA applications. Modeled after the C++ Standard Template Library (STL), Thrust brings a familiar abstraction layer to the realm of GPU computing.

Version 1.2 adds several new features, including:

As the following code example shows, Thrust programs are concise and readable.

#include <thrust/device_vector.h>

#include <thrust/iterator/constant_iterator.h>

#include <thrust/reduce.h>

#include <string>

// This example computes a run-length code for an array

// of characters using a key-value (or segmented) reduction

int main(void)


  // input data on the host

  std::string data = "aaabbbbbcddeeeeeeeeeff";

size_t N = data.size();

// copy input data to the device

  thrust::device_vector<char> input(data.begin(), data.end());

// allocate storage for output data and run lengths

  thrust::device_vector<char> output(N);

  thrust::device_vector<int>  lengths(N);

// compute run lengths

  size_t num_runs =

	thrust::reduce_by_key(input.begin(), input.end(),		// input key sequence

						  thrust::constant_iterator<int>(1), // input value sequence

						  output.begin(),					// output key sequence

						  lengths.begin()					// output value sequence

						  ).first - output.begin();		  // compute the output size

// output is now  [a,b,c,d,e,f]

  // lengths is now [3,5,1,2,9,2]

  return 0;


Get started with Thrust today! First download Thrust v1.2 and then follow the online quick-start guide. Refer to the online documentation for a complete list of features. Many concrete examples and a set of introductory slides are also available.

Thrust is open-source software distributed under the OSI-approved Apache License v2.0.


    Thanks to Gregory Diamos for contributing a CUDA implementation of set_intersection

    Thanks to Ryuta Suzuki & Gregory Diamos for rigorously testing Thrust’s unit tests and examples against Ocelot

    Thanks to Tom Bradley for contributing an implementation of normal_distribution

    Thanks to Joseph Rhoads for contributing the example summary_statistics

Nice work guys!

Hi, congratulation for this work, I really love the design of Thrust and I use it intensively for prototyping things.
However, I have found some performance problems in Thrust 1.1. For compactions for instance, it appears to be slower than CUDPP in most case, and it seems to be due to on the fly allocations of temporary buffers. Is this problem fixed with this new release ?


Scan performance is about 25% faster than v1.1, so the stream compaction functions will be noticeably faster. In v1.3 we’ll optimize functions like copy_if to use much less temporary memory and fewer read/write passes (it currently does a complete scan, which is unnecessary). So if v1.2 isn’t fast enough for you then v1.3 should be. If you find specific instances where the performance is lacking let us know and we’ll work on it for v1.3.

This release was mainly about adding features and making things as robust as possible. For example, you can use large data types or complex functors without worrying whether Thrust algorithms will run out or registers and shared memory. There shouldn’t be any artificial limits on input sizes either, so you can scan/sort/reduce huge data sets.

thank you!but why you guy use openmp other than openCL?

That’s great, I rewrote thrust copy_if so that I can pass a pre-allocated temporary buffer and it gives me a very good performance improvement.

Do you have a schedule for the release of thrust 1.3 ?

There’s an entry in our FAQ that answers that:

    When will Thrust support OpenCL?
      The primary barrier to OpenCL support is the lack of an OpenCL compiler and runtime with support for C++ templates (e.g. something similar to nvcc and the CUDA Runtime).

These features are necessary to achieve close coupling of the host and device codes.

We can support OpenMP because standard C++ compilers allow us to combine OpenMP and C++ templates. To our knowledge there is no compiler that provides us with way to do the same with OpenCL. User-defined types and operators would be difficult to support using the “driver” API as well (like OpenCL or the CUDA driver API). For example, if someone writes the following code,

struct MyStruct


   int x; float y; char z;


thrust::device_vector<MyStruct> v(10);

then the definition of ‘MyStruct’ needs to be available to all the kernels that operator on those types. With nvcc and the CUDA Runtime API this happens naturally, just as it would in a normal C/C++ program. However, with a driver API we’d have to know to insert the definition of ‘MyStruct’ into any code snippet we wanted to execute on the device.

While there are workarounds to this problem, none of them allow us to present the same “pure” interface of Thrust for CUDA and OpenMP.

We don’t have a pre-defined schedule for the next release, but if I had to guess I would say 4-6 months from now. However, it’s likely that we’ll improve copy_if before we make the next release, so if you’re willing to use a (potentially unstable) development version of Thrust, then you can get it sooner.

Anyway, I’ll post an update here when we’ve improved copy_if performance.

Is there any advantage in using the CUDA 3.0 toolkit over staying with the 2.3 toolkit?
When do you think thrust will drop support for older toolkits, such as 2.3?


nvcc 3.0 supports C++ much better than any previous compiler release. To see what I mean, you can grep through the Thrust source for “WAR” to see the places where we’ve had to work around a bug.

We intend to unsupport nvcc 2.3 in Thrust v1.3 [1], and ASAP in the development version.


thank you for your response. I got it. I use CUDA runtime other than CUDA driver, and is not familiar with OpenCL, so my opinion is immature.