Thrust prefix sum slower than CPU?

reikdas · February 10, 2020, 3:00pm

I am using Thrust to calculate prefix sum on the GPU.

The function prefix_sum calculates the prefix sum on GPU and the function foo calculates the prefix sum on CPU.

This is my code - (Note I am also performing some additional pre-processing before performing the exclusive scan)

#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/scan.h>
#include <cuda_runtime.h>
#include "device_launch_parameters.h"
#include <assert.h>
#include <chrono>

template <typename T, typename C>
__global__
void sub(T* output, const C* starter, const C* stopper, int64_t startsoffset, int64_t stopsoffset, int64_t n) {
  int thid = threadIdx.x + blockIdx.x * blockDim.x;
  if (thid < n) {
    C start = starter[thid + startsoffset];
    C stop = stopper[thid + stopsoffset];
    assert(start =< stop);
    output[thid] = stop - start;
  }
}

template <typename T, typename C>
void prefix_sum(T* output, const C* arr, const C* arr2, int64_t startsoffset, int64_t stopsoffset, int64_t length) {
  int block, thread;
  if (length > 1024) {
    block = (length / 1024) + 1;
    thread = 1024;
  }
  else {
    thread = length;
    block = 1;
  }
  T* d_output;
  C* d_arr, * d_arr2;
  cudaMalloc((void**)&d_output, length * sizeof(T));
  cudaMalloc((void**)&d_arr, length * sizeof(C));
  cudaMemcpy(d_arr, arr, length * sizeof(C), cudaMemcpyHostToDevice);
  cudaMalloc((void**)&d_arr2, length * sizeof(C));
  cudaMemcpy(d_arr2, arr2, length * sizeof(C), cudaMemcpyHostToDevice);
  sub<T, C> << <block, thread >> > (d_output, d_arr, d_arr2, startsoffset, stopsoffset, length);
  cudaDeviceSynchronize();
  thrust::device_vector<T> data(d_output, d_output + length);
  thrust::device_vector<T> temp(data.size() + 1);
  thrust::exclusive_scan(data.begin(), data.end(), temp.begin());
  temp[data.size()] = data.back() + temp[data.size() - 1];
  thrust::copy(temp.begin(), temp.end(), output);
  cudaFree(d_output);
  cudaFree(d_arr);
  cudaFree(d_arr2);
}

template <typename C, typename T>
void foo(T* tooffsets, const C* fromstarts, const C* fromstops, int64_t startsoffset, int64_t stopsoffset, int64_t length) {
  tooffsets[0] = 0;
  for (int64_t i = 0; i < length; i++) {
    C start = fromstarts[startsoffset + i];
    C stop = fromstops[stopsoffset + i];
    assert(start =< stop);
    tooffsets[i + 1] = tooffsets[i] + (stop - start);
  }
}

int main() {
  int const size = 70000;
  int starter, stopper, output;
  for (int i = 0; i < size; i++) {
    starter[i] = i;
    stopper[i] = i + 1;
  }
  prefix_sum<int, int>(output, starter, stopper, 0, 0, size); // GPU Warm up
  auto start2 = std::chrono::high_resolution_clock::now();
  foo<int, int>(output, starter, stopper, 0, 0, size);
  auto stop2 = std::chrono::high_resolution_clock::now();
  auto time2 = std::chrono::duration_cast<std::chrono::microseconds>(stop2 - start2);
  std::cout << "Time taken for CPU = " << time2.count() << "\n";
  auto start1 = std::chrono::high_resolution_clock::now();
  prefix_sum<int, int>(output, starter, stopper, 0, 0, size);
  auto stop1 = std::chrono::high_resolution_clock::now();
  auto time1 = std::chrono::duration_cast<std::chrono::microseconds>(stop1 - start1);
  std::cout << "Time taken for GPU = " << time1.count() << "\n";
}

My results are -

Time taken for CPU = 69
Time taken for GPU = 2140

also I am timing using the CPU timer chrono instead of the GPU timer cudaEvents() because I feel the additional non-cuda operations should also be timed considering that I am comparing the speed of two different approaches.

I am notably not calculating the average of several tests because I noticed that the difference was usually
similar.

I am not compiling in Debug mode.

cuda-memcheck does not yield any errors.

What could be happening? Especially since the GPU time is so much slower than the CPU time…

vacaloca · February 10, 2020, 8:58pm

If you want to use chrono timers, that is fine, but make sure you use cudaDeviceSynchronize() before you start your GPU timing, and right before you end the timing.

With the current setup, you are considering the time to establish the CUDA context on the device and etc, which can be significant

reikdas · February 11, 2020, 12:37pm

Did not make a difference in times.

Increased the GPU time a little.

(Yes I did both at the same time)

mnicely · February 11, 2020, 2:50pm

In the future, providing your software and hardware stack allows for more insight into your issue.

Your data set is too small to utilize a modern GPU. Depending on the resources of your GPU and CPU, it’s easily the case that your CPU (especially with -O3) finished first.
It makes sense that added cudaDeviceSynchronize after your kernel. That stops the host code from proceeding until all GPU tasks have finished. Otherwise the code is return immediately after the kernel is launched.
While I understand that you want to compare apples-to-apples by adding all memory management into your timing report, that is not the suggested why to utilize a GPU. For best performance, it is advised to perform memory allocation and transfers once at the initialization of your program, if possible perform all compute on the GPU, and then transfer any results. This may not be ideal for large, complex applications but it should be the model to go by.
On my laptop, with an i7-8750H and a RTX2080 w/Max-Q, I get the following results.

size = 70000

Time taken for CPU = 308
Time taken for GPU = 363

size = 1000000

Time taken for CPU = 4377
Time taken for GPU = 2685

Now if I look at the compute independently. You can see that it is a fraction of the GPU workload and nearly 40x faster than the CPU.

size = 1000000

Time taken for CPU = 4344
Time taken for Thrust = 113
Time taken for GPU = 2721

Even if we turn on optimization on the CPU with -03, the GPU is still 12x faster.

size = 1000000

Time taken for CPU = 1367
Time taken for Thrust = 113
Time taken for GPU = 2682

#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/scan.h>
#include <cuda_runtime.h>
#include "device_launch_parameters.h"
#include <cassert>
#include <chrono>
#include <vector> 

template <typename T, typename C>
__global__
void sub(T* output, const C* starter, const C* stopper, int64_t startsoffset, int64_t stopsoffset, int64_t n) {
  int thid = threadIdx.x + blockIdx.x * blockDim.x;
  if (thid < n) {
    C start = starter[thid + startsoffset];
    C stop = stopper[thid + stopsoffset];
    assert(start <= stop);
    output[thid] = stop - start;
  }
}

template <typename T, typename C>
void prefix_sum(T* output, const C* arr, const C* arr2, int64_t startsoffset, int64_t stopsoffset, int64_t length) {
  int block, thread;
  if (length > 1024) {
    block = (length / 1024) + 1;
    thread = 1024;
  }
  else {
    thread = length;
    block = 1;
  }
  T* d_output;
  C* d_arr, * d_arr2;
  cudaMalloc((void**)&d_output, length * sizeof(T));
  cudaMalloc((void**)&d_arr, length * sizeof(C));
  cudaMemcpy(d_arr, arr, length * sizeof(C), cudaMemcpyHostToDevice);
  cudaMalloc((void**)&d_arr2, length * sizeof(C));
  cudaMemcpy(d_arr2, arr2, length * sizeof(C), cudaMemcpyHostToDevice);
  
  auto start1 = std::chrono::high_resolution_clock::now();
  sub<T, C> << <block, thread >> > (d_output, d_arr, d_arr2, startsoffset, stopsoffset, length);
  cudaDeviceSynchronize();
  auto stop1 = std::chrono::high_resolution_clock::now();
  auto time1 = std::chrono::duration_cast<std::chrono::microseconds>(stop1 - start1);
  std::cout << "Time taken for Thrust = " << time1.count() << "\n";
  
  thrust::device_vector<T> data(d_output, d_output + length);
  thrust::device_vector<T> temp(data.size() + 1);
  thrust::exclusive_scan(data.begin(), data.end(), temp.begin());
  temp[data.size()] = data.back() + temp[data.size() - 1];
  thrust::copy(temp.begin(), temp.end(), output);
  cudaFree(d_output);
  cudaFree(d_arr);
  cudaFree(d_arr2);
}

template <typename C, typename T>
void foo(T* tooffsets, const C* fromstarts, const C* fromstops, int64_t startsoffset, int64_t stopsoffset, int64_t length) {
  tooffsets[0] = 0;
  for (int64_t i = 0; i < length; i++) {
    C start = fromstarts[startsoffset + i];
    C stop = fromstops[stopsoffset + i];
    assert(start <= stop);
    tooffsets[i + 1] = tooffsets[i] + (stop - start);
  }
}

int main() {
  int const size = 1000000;
//   int starter, stopper, output;
  std::vector<int> starter (size);
  std::vector<int> stopper (size);
  std::vector<int> output (size + 1);
  
  for (int i = 0; i < size; i++) {
    starter[i] = i;
    stopper[i] = i + 1;
  }
  prefix_sum<int>(output.data(), starter.data(), stopper.data(), 0, 0, size); // GPU Warm up
  auto start2 = std::chrono::high_resolution_clock::now();
  foo<int, int>(output.data(), starter.data(), stopper.data(), 0, 0, size);
  auto stop2 = std::chrono::high_resolution_clock::now();
  auto time2 = std::chrono::duration_cast<std::chrono::microseconds>(stop2 - start2);
  std::cout << "Time taken for CPU = " << time2.count() << "\n";
  auto start1 = std::chrono::high_resolution_clock::now();
  prefix_sum<int>(output.data(), starter.data(), stopper.data(), 0, 0, size);
  auto stop1 = std::chrono::high_resolution_clock::now();
  auto time1 = std::chrono::duration_cast<std::chrono::microseconds>(stop1 - start1);
  std::cout << "Time taken for GPU = " << time1.count() << "\n";
}

I would highly suggest you look at our Fundamentals of Accelerated Computing with CUDA C/C++ https://courses.nvidia.com/courses/course-v1:DLI+C-AC-01+V1/about. It covers a lot of the basics and gives you an understanding of how the GPU works.

Topic		Replies	Views
My thrust code is 10 times slower than CPU, what did I do wrong GPU-Accelerated Libraries cuda , thrust	8	1449	November 15, 2022
ML + Time series CUDA Programming and Performance	8	1202	November 10, 2015
Why kernel calculate speed got slower after waiting for a while? CUDA Programming and Performance cuda	9	1763	July 19, 2022
GPU running time is not stable CUDA Programming and Performance	5	3056	April 24, 2010
Unable to time functions correctly CUDA Programming and Performance	5	739	February 6, 2020
How do I know if thrust::inclusive_scan is being run in parallel on the GPU? CUDA Programming and Performance	2	1811	June 4, 2014
My GPU cannot support cuda stream priority CUDA Setup and Installation tensorrt , kernel	4	228	June 3, 2024
GPU/CPU precision comparison and Kernel instructions question CUDA Programming and Performance	5	680	April 4, 2017
Confused about GPU vs CPU speed in multiplication CUDA Programming and Performance	8	6552	February 19, 2009
Using Thrust to sort Unified Memory Buffer? GPU-Accelerated Libraries	8	5114	May 7, 2015

Thrust prefix sum slower than CPU?

Related topics