Thrust `host` side and `device` side behavior

trsjtu17 · January 5, 2024, 8:59am

Thrust Function Behavior in Host vs. Device Context

Hi everyone,

I’m working with CUDA and Thrust, and I have a question regarding the behavior of Thrust functions when they are called in different contexts. Specifically, I’m interested in understanding the differences when calling a Thrust function on the host (__host__) versus on the device (__device__), assuming both use thrust::device as the execution policy.

Example: thrust::set_union

Background

My current understanding is as follows:

On the Host Side: When called from the host, Thrust handles the CUDA kernel launching, and the operations are executed in parallel on the GPU.
On the Device Side: I’m unclear about the behavior here. If a Thrust function is called from a device function, what happens? Since the resources are allocated to the specific device thread, does this mean the function will only run on a single thread?

Questions

What exactly happens when a Thrust function is called from a device function?
If such a function runs on a single thread when called from the device side, how does it affect performance and parallelism?
Are there any best practices or alternative approaches for using Thrust functions within device code?

Any insights or explanations would be greatly appreciated. I’m looking to deepen my understanding of CUDA and Thrust, particularly for complex parallel computing scenarios.

Thanks in advance!

Robert_Crovella · January 5, 2024, 3:06pm

Thrust has various mechanisms to dispatch work. Work launched from the host can be dispatched to either a host back-end or a device back-end. When dispatching work to the device back-end, this is (in my view) the “typical” usage of thrust, and it does the things you say:

As indicated in the link above, thrust dispatch is mostly resolved at compile-time. When you use thrust functions called from device code, the execution policy you use governs behavior to a large degree.

If you use the thrust::seq execution policy, then the entire operation will execute from the point of view of a single thread. There is no interthread cooperation, each thread executes an entire instance of the function you called. Each thread works on a separate problem. This type of work distribution might be useful for many small problems, but in the general case, its typically not a very efficient usage of the GPU because the usual things we look for in GPU code such as coalesced access are not accounted for or provided for. There is no parallel cooperation among threads, except at a very high level.

When you specify an execution policy of thrust::device then the method of dispatch might be as described above (for thrust::seq) or it might be something else, such as leveraging CUDA CDP. If thrust uses CUDA CDP, it means that although each thread is processing its own problem, instead of doing it fully sequentially from the point of view of a single thread, instead thrust may opt to have that thread call another GPU kernel using CUDA Dynamic Parallelism (CDP). One indicator of whether that may happen is whether your compilation environment supports CDP, and other factors.

Roughly speaking this question has been asked elsewhere such as here and with a bit of google searching you can find other material such as here and here which may be of interest.

Thrust has been changing quite a bit in the last few years, so it’s possible that the material is dated or out of date.

trsjtu17 · January 6, 2024, 1:53am

Thanks a lot! This is very detailed answer! Have a nice day!

system · January 20, 2024, 1:53am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Thrust question CUDA Programming and Performance	1	3142	February 14, 2010
Thrust: Concurrency and Kernels CUDA Programming and Performance	3	929	June 12, 2023
About thrust::execution_policy when copying data from device to host GPU-Accelerated Libraries cuda	0	1009	September 6, 2020
Default Thrust execution policy CUDA Programming and Performance	1	2327	June 19, 2017
Can I use thrust algorithms with dynamic parallelism? CUDA Programming and Performance	2	2001	August 29, 2014
CUDA 8 - Thrust bug(?) CUDA Programming and Performance	16	4311	October 2, 2016
Thrust and concurrent execution on multi-GPU CUDA Programming and Performance	1	1585	February 21, 2018
Segmentation fault for thrust::host execution policy GPU-Accelerated Libraries cuda	1	651	September 14, 2020
Disable Warning: calling a __host__ function from a __host__ __device__ CUDA Programming and Performance	0	1510	April 22, 2012
Dumb question but do I need to synchronize after Thrust calls? CUDA Programming and Performance	2	2336	October 9, 2016

Thrust `__host__` side and `__device__` side behavior

Thrust Function Behavior in Host vs. Device Context

Background

Questions

Related topics

Thrust `host` side and `device` side behavior