Thrust: Concurrency and Kernels

Tried finding this info elsewhere but struggling to find useful info.

  1. Does thrust use concurrent copy/execute where possible? ie if I am doing a thrust::transform from two host vectors, will thrust copy and execute in staggered segments such as in the bottom example?
    image
    My assumption is no, as the = operator to move data from a host to device vector doesn’t seem to be linked to the algorithm execution. Even if not, can you use streams to manually segment your thrust executions and copies to replicate?

  2. You can use something like:

thrust::raw_pointer_cast(d_vec.data())

to get access to a pointer for use in a CUDA kernel. Is there a way of doing the reverse? Allocating memory with cudaMalloc, then later wrapping that memory in a thrust::device_vector and using thrust algorthims on it?

Any explanations and/or resource pages on either would be welcome.

1 Like

Then thrust will dispatch the algorithm to the host backend. i.e. it will run that transform operation on the CPU. Thrust doesn’t under any circumstances do what you have pictured. That has to be constructed yourself out of thrust primitives. Although that link has useful components, in modern thrust its no longer necessary to use the experimental pinned allocator.

You cannot wrap a thrust::device_vector around a pre-existing device allocation (at least not at the level of this discussion here. If you want to customize the thrust::device_vector class yourself, that is a different discussion. In practice, I never assume people are asking that sort of question unless stated explicitly). However, you can take a “raw” pointer, like one returned from cudaMalloc and wrap a thrust::device_ptr around it. This will likely give you what you need - the ability to use that in thrust algorithms. There are numerous questions on various forums demonstrating use of thrust::device_ptr, here is one.

I meant if you had two host vectors that you were trying to transfer to device and execute device-algorithms on in a staggered fashion ala CUDA concurrency. I suspected this was not an option out of the box.

Wasn’t aware of device_ptr. That helps.

correct, its not automatic or built-in

You need to schedule the activity yourself using thrust primitives, as the linked example shows. Each of the individual steps will require an API call, just as it would if you were doing it using “ordinary CUDA”.