Thrust v1.1 release A high-level C++ template library for CUDA

We are pleased to announce the release of Thrust v1.1, an open-source template library for developing CUDA applications. Modeled after the C++ Standard Template Library (STL), Thrust brings a familiar abstraction layer to the realm of GPU computing.

Version 1.1 adds several new features, including:

As the following code example shows, Thrust programs are concise and readable.


#include <thrust/host_vector.h>

#include <thrust/device_vector.h>

#include <thrust/generate.h>

#include <thrust/sort.h>


int main(void)


// generate twenty random numbers on the host

thrust::host_vector h_vec(20);

thrust::generate(h_vec.begin(), h_vec.end(), rand);

// transfer data to the device

thrust::device_vector d_vec = h_vec;

// sort data on the device

thrust::sort(d_vec.begin(), d_vec.end());

return 0;



Get started with Thrust today! First download Thrust v1.1 and then follow the online tutorial. Refer to the online documentation for a complete list of features. Many concrete examples and a set of introductory slides are also available.

Thrust is open-source software distributed under the OSI-approved Apache License v2.0.

I’m pretty happy about pinned memory support. I would like nothing better than to be able to drop my own custom array wrapper.

I’m so seriously impressed with Thrust! Primarily the clean and elegant design, which was lacking from CUDPP (though they had their reasons).

Sadly we’re not ever going to be using it in our production environment, due to it being restricted to the Runtime API - but our research team is looking into using it internally to speed up some of their own tools, along with cublas and our own internal cuda libs. :)

Great work guys!

Thanks for the feedback!

Can you think of any features or changes (either to Thrust, the CUDA runtime API, or the CUDA driver API) that would enable use of Thrust in your production environment?

I should clarify, the #1 reason we don’t use the Runtime API (and anything that depends on it) is that previously (though this is less so the case right now) it lagged behind the Driver API, but primarily - it’s extra work to delay load cudart…

I haven’t looked into the problem of delay loading cudart lately, but I’m guessing it still requires one to abstract all calls to it, into your own DLL, and then delay load that your own DLL instead… Unless I missed something in the release notes of the past few revisions, where you guys implemented delay loading into cudart? (but I’m fairly certain I wouldn’t have missed something as huge as that.)

Thrust (and I’m guessing cublas? I can’t recall - I’ve never used it personally) also directly depend on cudart, meaning any calls to Thrust/etc we want to make, would also have to be abstracted into their respective shared libraries as well… (it’s viral you see :P)

100% of our CUDA accelerated apps, don’t assume CUDA is installed, don’t assume any nVidia product of any kind is installed, and as such - we can’t load any CUDA dlls until the very last millisecond - where the user or our application has determined they ‘do’ indeed want to use CUDA, and they have everything installed to do so…

On the contrary - the Driver API is exceedingly simple to delay load… and despite losing ‘features’ like cuda-gdb (unstable… difficult to use… etc) and device emulation (serial thread execution, which means anything that requires warp/block synchronization - is instantly going to break… requiring one to maintain a second codebase for those kernels that do require synchronization to work… ugh!) - we had no choice but to stick with the Driver API because of that.

The Driver API also gives a much clearer line drawn between host and device code, making it simpler for regular devs (with no CUDA experience) to not get confused, and simplifies the build process as well (which is important for people using custom build systems) - maintaining a nice happy development eco-system for everyone, regardless of experience :P

As for the solution (now I’ve had my rant ;)) - it’s been pretty simple in my eyes for quite some time (read: years…) now… The solution depends on how cudart is implemented though… I’ve always assumed it was built on top the driver api (no other magic nvidia driver calls, pure Driver API wrapper), so I’m going to maintain that assumption for now.

If this is the case, the ‘best’ solution to me would be to directly implement delay-loading either into the driver api (preferable, if cudart relies exclusively on the drive rapi) or the runtime api (if the runtime api has other trickery besides driver api calls)…

As it stands, linking against the Driver API (and thus cudart) the first call into the Driver API (cuInit), and thus the first call into cudart (and thus the first call into Thrust…?) will force-load nvcuda.dll/ (depending on OS) when loading your executable (eg: before the executable’s entry point is executed? I haven’t tested this recently, but this is what I recall) - which is simply unacceptable for any production environment… instant crash on any machine without nVidia CUDA drivers installed.

Edit: Fixed misinformation (re-read this last night at home… noticed this error)

Windows/msvc users have it nice and easy, they can use delayimp.lib to delay load the Driver API, *nix based systems though require one to manually delay load (read: write your own cuda.lib/cuda.a library, which dynamically loads the real one, and at runtime, checks for the existance of your function, and redirects to that if it exists… otherwise throw a nice “invalid cuda version” exception).

(Ironically as I was writing this, my CEO came up and asked me specifically what was taking so long - and I just explained exactly what I’m explaining here… but with an emphasis on lack of debugging tools… except Nexus…)

Anyway this has been a widely known (well I thought so… maybe not?) issue for quite some time now, see below:…125186021?pli=1

I could find a few more I’m sure, but I think my rant explains the problem well enough for you guys to see the value of delay loading things on your end - and not ours…

Smokey – thanks for the very thorough explanation of this important issue. This is the first time I’ve come across this issue personally, though I’m sure others are aware of it. We’ll keep these concerns in mind going forward.

google: “Delay loading NVIDIA CUDA site:” - I am feeling lucky