If you run thrust::inclusive_scan() using device_vector arguments (or other suitable device arguments such as device_ptr) the algorithm will run on the GPU.
You can profile the app if you are not the trusting type, and you will see kernels launched, etc.
And if you are a raving conspiracy theorist, then shut down your PC, pull the GPU out, reboot it, and run the same thrust device_vector app again. It will fail miserably.
Here’s what nvprof spits out for me on a thrust inclusive scan program:
$ cat t429.cu
#include <iostream>
#include <thrust/scan.h>
#include <thrust/device_vector.h>
int main()
{
thrust::device_vector<float> vector(4);
vector[0] = 1.0;
vector[1] = -3.0;
vector[2] = 4.0;
vector[3] = 5.0;
thrust::inclusive_scan(vector.begin(), vector.end(), vector.begin());
thrust::host_vector<float> h_vector(vector.begin(), vector.end());
for (size_t i = 0; i < 4; ++i) {
std::cout << h_vector[i] << std::endl;
}
return 0;
}
$ nvprof ./t429
==8822== NVPROF is profiling process 8822, command: ./t429
1
-2
2
7
==8822== Profiling application: ./t429
==8822== Profiling result:
Time(%) Time Calls Avg Min Max Name
26.10% 6.4270us 1 6.4270us 6.4270us 6.4270us void thrust::system::cuda::detail::detail::launch_closure_by_value<thrust::system::cuda::detail::detail::fast_scan::fast_scan_detail::downsweep_intervals_closure<bool=1, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, float, thrust::plus<float>, thrust::system::detail::internal::uniform_decomposition<unsigned int>, thrust::system::cuda::detail::detail::statically_blocked_thread_array<unsigned int=224>>>(bool=1)
26.10% 6.4270us 1 6.4270us 6.4270us 6.4270us void thrust::system::cuda::detail::detail::launch_closure_by_value<thrust::system::cuda::detail::detail::fast_scan::fast_scan_detail::downsweep_intervals_closure<bool=1, float*, float*, float, thrust::plus<float>, thrust::system::detail::internal::uniform_decomposition<unsigned int>, thrust::system::cuda::detail::detail::statically_blocked_thread_array<unsigned int=224>>>(bool=1)
16.01% 3.9410us 1 3.9410us 3.9410us 3.9410us void thrust::system::cuda::detail::detail::launch_closure_by_value<thrust::system::cuda::detail::commutative_reduce_intervals_closure<thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::detail::normal_iterator<thrust::pointer<float, thrust::system::cuda::detail::tag, thrust::use_default, thrust::use_default>>, thrust::plus<float>, thrust::system::detail::internal::uniform_decomposition<unsigned int>, thrust::system::cuda::detail::detail::blocked_thread_array>>(float)
15.86% 3.9040us 4 976ns 960ns 992ns [CUDA memcpy HtoD]
8.26% 2.0340us 1 2.0340us 2.0340us 2.0340us void thrust::system::cuda::detail::detail::launch_closure_by_value<thrust::system::cuda::detail::for_each_n_detail::for_each_n_closure<thrust::device_ptr<__int64>, unsigned int, thrust::detail::device_generate_functor<thrust::detail::fill_functor<__int64>>, thrust::system::cuda::detail::detail::blocked_thread_array>>(__int64)
7.67% 1.8880us 1 1.8880us 1.8880us 1.8880us [CUDA memcpy DtoH]
==8822== API calls:
Time(%) Time Calls Avg Min Max Name
99.51% 100.37ms 2 50.185ms 9.0000us 100.36ms cudaMalloc
0.19% 193.00us 83 2.3250us 0ns 73.000us cuDeviceGetAttribute
0.10% 99.000us 2 49.500us 14.000us 85.000us cudaFree
0.06% 57.000us 5 11.400us 6.0000us 20.000us cudaMemcpy
0.05% 47.000us 4 11.750us 8.0000us 19.000us cudaLaunch
0.03% 28.000us 1 28.000us 28.000us 28.000us cuDeviceTotalMem
0.02% 25.000us 1 25.000us 25.000us 25.000us cudaGetDeviceProperties
0.02% 21.000us 1 21.000us 21.000us 21.000us cuDeviceGetName
0.01% 11.000us 2 5.5000us 4.0000us 7.0000us cudaFuncGetAttributes
0.01% 7.0000us 6 1.1660us 1.0000us 2.0000us cudaGetDevice
0.00% 4.0000us 4 1.0000us 1.0000us 1.0000us cudaConfigureCall
0.00% 2.0000us 2 1.0000us 0ns 2.0000us cuDeviceGet
0.00% 2.0000us 2 1.0000us 0ns 2.0000us cuDeviceGetCount
0.00% 2.0000us 4 500ns 0ns 1.0000us cudaSetupArgument
$