How do I know if thrust::inclusive_scan is being run in parallel on the GPU?

Hey All,

So I’m looking through some of the thrust documentation and they claim it’s parallel.

But how do I actually know this is running on my GPU and using all the possible threads that it can? Can I really just trust thrust assuming I’m assuing a thrust::device_vector?

Basically, if I type thrust::inclusive_scan(), will it run on the GPU?

If you run thrust::inclusive_scan() using device_vector arguments (or other suitable device arguments such as device_ptr) the algorithm will run on the GPU.

You can profile the app if you are not the trusting type, and you will see kernels launched, etc.

And if you are a raving conspiracy theorist, then shut down your PC, pull the GPU out, reboot it, and run the same thrust device_vector app again. It will fail miserably.

Here’s what nvprof spits out for me on a thrust inclusive scan program:

$ cat t429.cu
#include <iostream>
#include <thrust/scan.h>
#include <thrust/device_vector.h>

int main()
{
    thrust::device_vector<float> vector(4);
    vector[0] = 1.0;
    vector[1] = -3.0;
    vector[2] = 4.0;
    vector[3] = 5.0;

    thrust::inclusive_scan(vector.begin(), vector.end(), vector.begin());
    thrust::host_vector<float> h_vector(vector.begin(), vector.end());

    for (size_t i = 0; i < 4; ++i) {
        std::cout << h_vector[i] << std::endl;
    }
    return 0;
}
$ nvprof ./t429
==8822== NVPROF is profiling process 8822, command: ./t429
1
-2
2
7
==8822== Profiling application: ./t429
==8822== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 26.10%  6.4270us         1  6.4270us  6.4270us  6.4270us  void thrust::system::cuda::detail::detail::launch_closure_by_value<thrust::system::cuda::detail::detail::fast_scan::fast_scan_detail::downsweep_intervals_closure<bool=1, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, float, thrust::plus<float>, thrust::system::detail::internal::uniform_decomposition<unsigned int>, thrust::system::cuda::detail::detail::statically_blocked_thread_array<unsigned int=224>>>(bool=1)
 26.10%  6.4270us         1  6.4270us  6.4270us  6.4270us  void thrust::system::cuda::detail::detail::launch_closure_by_value<thrust::system::cuda::detail::detail::fast_scan::fast_scan_detail::downsweep_intervals_closure<bool=1, float*, float*, float, thrust::plus<float>, thrust::system::detail::internal::uniform_decomposition<unsigned int>, thrust::system::cuda::detail::detail::statically_blocked_thread_array<unsigned int=224>>>(bool=1)
 16.01%  3.9410us         1  3.9410us  3.9410us  3.9410us  void thrust::system::cuda::detail::detail::launch_closure_by_value<thrust::system::cuda::detail::commutative_reduce_intervals_closure<thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::detail::normal_iterator<thrust::pointer<float, thrust::system::cuda::detail::tag, thrust::use_default, thrust::use_default>>, thrust::plus<float>, thrust::system::detail::internal::uniform_decomposition<unsigned int>, thrust::system::cuda::detail::detail::blocked_thread_array>>(float)
 15.86%  3.9040us         4     976ns     960ns     992ns  [CUDA memcpy HtoD]
  8.26%  2.0340us         1  2.0340us  2.0340us  2.0340us  void thrust::system::cuda::detail::detail::launch_closure_by_value<thrust::system::cuda::detail::for_each_n_detail::for_each_n_closure<thrust::device_ptr<__int64>, unsigned int, thrust::detail::device_generate_functor<thrust::detail::fill_functor<__int64>>, thrust::system::cuda::detail::detail::blocked_thread_array>>(__int64)
  7.67%  1.8880us         1  1.8880us  1.8880us  1.8880us  [CUDA memcpy DtoH]

==8822== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 99.51%  100.37ms         2  50.185ms  9.0000us  100.36ms  cudaMalloc
  0.19%  193.00us        83  2.3250us       0ns  73.000us  cuDeviceGetAttribute
  0.10%  99.000us         2  49.500us  14.000us  85.000us  cudaFree
  0.06%  57.000us         5  11.400us  6.0000us  20.000us  cudaMemcpy
  0.05%  47.000us         4  11.750us  8.0000us  19.000us  cudaLaunch
  0.03%  28.000us         1  28.000us  28.000us  28.000us  cuDeviceTotalMem
  0.02%  25.000us         1  25.000us  25.000us  25.000us  cudaGetDeviceProperties
  0.02%  21.000us         1  21.000us  21.000us  21.000us  cuDeviceGetName
  0.01%  11.000us         2  5.5000us  4.0000us  7.0000us  cudaFuncGetAttributes
  0.01%  7.0000us         6  1.1660us  1.0000us  2.0000us  cudaGetDevice
  0.00%  4.0000us         4  1.0000us  1.0000us  1.0000us  cudaConfigureCall
  0.00%  2.0000us         2  1.0000us       0ns  2.0000us  cuDeviceGet
  0.00%  2.0000us         2  1.0000us       0ns  2.0000us  cuDeviceGetCount
  0.00%  2.0000us         4     500ns       0ns  1.0000us  cudaSetupArgument
$

Thank you :)

I just wanted to make absolutely sure, is all. I did find that reading the github documentation that there’s two implementations of the thrust library, the host and device versions. Using host or device iterators launches the appropriate functions so thank you.

I also completely forgot about nvprof lol.