Thrust: Concurrency and Kernels

aven_omega · June 12, 2023, 7:11pm

Tried finding this info elsewhere but struggling to find useful info.

Does thrust use concurrent copy/execute where possible? ie if I am doing a thrust::transform from two host vectors, will thrust copy and execute in staggered segments such as in the bottom example?

My assumption is no, as the = operator to move data from a host to device vector doesn’t seem to be linked to the algorithm execution. Even if not, can you use streams to manually segment your thrust executions and copies to replicate?
You can use something like:

thrust::raw_pointer_cast(d_vec.data())

to get access to a pointer for use in a CUDA kernel. Is there a way of doing the reverse? Allocating memory with cudaMalloc, then later wrapping that memory in a thrust::device_vector and using thrust algorthims on it?

Any explanations and/or resource pages on either would be welcome.

Robert_Crovella · June 12, 2023, 7:25pm

Then thrust will dispatch the algorithm to the host backend. i.e. it will run that transform operation on the CPU. Thrust doesn’t under any circumstances do what you have pictured. That has to be constructed yourself out of thrust primitives. Although that link has useful components, in modern thrust its no longer necessary to use the experimental pinned allocator.

You cannot wrap a thrust::device_vector around a pre-existing device allocation (at least not at the level of this discussion here. If you want to customize the thrust::device_vector class yourself, that is a different discussion. In practice, I never assume people are asking that sort of question unless stated explicitly). However, you can take a “raw” pointer, like one returned from cudaMalloc and wrap a thrust::device_ptr around it. This will likely give you what you need - the ability to use that in thrust algorithms. There are numerous questions on various forums demonstrating use of thrust::device_ptr, here is one.

aven_omega · June 12, 2023, 8:26pm

I meant if you had two host vectors that you were trying to transfer to device and execute device-algorithms on in a staggered fashion ala CUDA concurrency. I suspected this was not an option out of the box.

Wasn’t aware of device_ptr. That helps.

Robert_Crovella · June 12, 2023, 8:41pm

correct, its not automatic or built-in

You need to schedule the activity yourself using thrust primitives, as the linked example shows. Each of the individual steps will require an API call, just as it would if you were doing it using “ordinary CUDA”.

Topic		Replies	Views
Passing thurst vector into kernel and pushing data into vector CUDA Programming and Performance	8	7852	January 2, 2018
Please help me understand some issues regarding concurrent kernel execution CUDA Programming and Performance	5	558	July 23, 2019
Thrust::gather for input and output sequences that coincide CUDA Programming and Performance	6	1029	November 16, 2020
Thrust and concurrent execution on multi-GPU CUDA Programming and Performance	1	1516	February 21, 2018
Is thrust::copy synchrous or asynchronus? GPU-Accelerated Libraries	2	3922	August 11, 2015
Device to device copying with Thrust CUDA Programming and Performance	1	593	August 27, 2015
Thrust::copy_if on multiple range CUDA Programming and Performance cuda	3	1446	July 31, 2021
Thrust multidimensional vectors passing vector of vectors to a cuda kernel CUDA Programming and Performance	1	2280	June 14, 2011
thrust pts cast convert CUDA Programming and Performance	7	2892	March 3, 2017
CUDA 8 - Thrust bug(?) CUDA Programming and Performance	16	4126	October 2, 2016

Thrust: Concurrency and Kernels

Related topics