Thrust `__host__` side and `__device__` side behavior

Thrust Function Behavior in Host vs. Device Context

Hi everyone,

I’m working with CUDA and Thrust, and I have a question regarding the behavior of Thrust functions when they are called in different contexts. Specifically, I’m interested in understanding the differences when calling a Thrust function on the host (__host__) versus on the device (__device__), assuming both use thrust::device as the execution policy.

Example: thrust::set_union

Background

My current understanding is as follows:

  • On the Host Side: When called from the host, Thrust handles the CUDA kernel launching, and the operations are executed in parallel on the GPU.
  • On the Device Side: I’m unclear about the behavior here. If a Thrust function is called from a device function, what happens? Since the resources are allocated to the specific device thread, does this mean the function will only run on a single thread?

Questions

  1. What exactly happens when a Thrust function is called from a device function?
  2. If such a function runs on a single thread when called from the device side, how does it affect performance and parallelism?
  3. Are there any best practices or alternative approaches for using Thrust functions within device code?

Any insights or explanations would be greatly appreciated. I’m looking to deepen my understanding of CUDA and Thrust, particularly for complex parallel computing scenarios.

Thanks in advance!

Thrust has various mechanisms to dispatch work. Work launched from the host can be dispatched to either a host back-end or a device back-end. When dispatching work to the device back-end, this is (in my view) the “typical” usage of thrust, and it does the things you say:

As indicated in the link above, thrust dispatch is mostly resolved at compile-time. When you use thrust functions called from device code, the execution policy you use governs behavior to a large degree.

If you use the thrust::seq execution policy, then the entire operation will execute from the point of view of a single thread. There is no interthread cooperation, each thread executes an entire instance of the function you called. Each thread works on a separate problem. This type of work distribution might be useful for many small problems, but in the general case, its typically not a very efficient usage of the GPU because the usual things we look for in GPU code such as coalesced access are not accounted for or provided for. There is no parallel cooperation among threads, except at a very high level.

When you specify an execution policy of thrust::device then the method of dispatch might be as described above (for thrust::seq) or it might be something else, such as leveraging CUDA CDP. If thrust uses CUDA CDP, it means that although each thread is processing its own problem, instead of doing it fully sequentially from the point of view of a single thread, instead thrust may opt to have that thread call another GPU kernel using CUDA Dynamic Parallelism (CDP). One indicator of whether that may happen is whether your compilation environment supports CDP, and other factors.

Roughly speaking this question has been asked elsewhere such as here and with a bit of google searching you can find other material such as here and here which may be of interest.

Thrust has been changing quite a bit in the last few years, so it’s possible that the material is dated or out of date.

1 Like

Thanks a lot! This is very detailed answer! Have a nice day!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.