Which is more efficient...

I’m wondering, let’s suppose that I wanted to chain a bunch of thrust library calls. Maybe we call transform then reduce then sort_by_key and then another transform or something like that.

I’m imagining that this would be less efficient than one kernel that could magically do all the things we needed it to.

Am I right in thinking that calling kernels sequentially from the host would be slower than just calling one magic kernel?

I only ask because there’s a lot of awesome thrust stuff I’d like to chain together but I don’t want my code to be slow either…

In essence, how do I use thrust more efficiently?

Yes, all other things being equal, it is more efficient to call one kernel than to call several.

For the most part, a call to a thrust algorithm will result in a kernel call.

The canonical advice would be to use fusion of operations. This can’t be done ad infinitum or in any arbitrary case, but it will allow, primarily through the use of thrust fancy iterators, to fuse multiple operations into a single thrust algorithm, thus resulting in accomplishing those operations in a single thrust kernel call.

For example, suppose I had a reduction operation where I wanted to sum the squares of every element of an array.

I could realize this naively as a thrust transform (to square each element) followed by a thrust::reduce (to sum each of the previously squared elements).

Using a transform iterator, I can pass a transform iterator to the thrust::reduce operation, that will square the elements as they are being reduced.

Since combinations of transforms followed by reductions are so common, this particular use-case is simplified for the thrust programmer by provision of the complex algorithm thrust::transform_reduce.

There are whole presentations on thrust fusion of operations. Take a look at the presentations that are available linked from the thrust github site.

This presentation:


begins to discuss thrust “best practices” including fusion around slide 23.

Hey, thanks!

I’m also starting to think that any algorithm that supports lazy evaluation would do very well in CUDA.