Why Thrust transform function is so slow

sladapter · September 5, 2014, 4:54pm

I have done some testing by simply adding two large vectors using different method: looping each index in CPU, Thrust host_vector + Thrust::transform, self wrote kernelAdd function, and CUBLAS axpy function.

I have found using Thrust::transform is really slow. Here is the result, I only listed GPU calculation time, not including CPU<->GPU data transfer time.

vector size: 1000,000:
using CPU looping: 17 ms

using Thrust device_vector and transform function with axpy functor (GPU) : 65 ms (calculation time in GPU)

using self wrote kernelAdd function: < 1ms
using CUBLAS axpy function (GPU): < 1ms

vector size: 4000,000:
using CPU looping: 70 ms
using Thrust library transform function with axpy functor (GPU) : 252 ms
using self wrote kernelAdd function: < 1ms
using CUBLAS axpy function (GPU): < 1ms

following is the testing code for Thrust:
#define COUNT 4000000
vector X(COUNT);
vector Y(COUNT);
for (int i = 0; i < COUNT; i++)
{
X[i] = (double) ((rand() % COUNT) - 5000);
Y[i] = (double) ((rand() % COUNT) - 5000);
}

for (int i = 0; i < 2; i++)
{
long t = clock();
device_vector dX = X;
device_vector dY = Y;
cout << “copy data from CPU to GPU " << clock() - t << " ms” << endl;
t = clock();
thrust::transform(dX.begin(), dX.end(), dY.begin(), dY.begin(), thrust::plus());
cout << “Calculation using Thrust device_vector (GPU): " << clock() - t << " ms” << endl;
t = clock();
thrust::copy(dY.begin(), dY.end(), Z.begin());
cout << “copy result back to CPU: " << clock() - t << " ms” << endl;
}

Anyone has experienced the same problem?

By the way, I’m using CUDA 6.5 the latest release.

Robert_Crovella · September 5, 2014, 5:31pm

When I run this code:

#include <time.h>
#include <sys/time.h>
#include <thrust/device_vector.h>
#include <thrust/transform.h>
#include <vector>
#define COUNT 4000000
#define MS_PER_SEC 1000.0f

int main(){
  std::vector<double> X(COUNT);
  std::vector<double> Y(COUNT);
  std::vector<double> Z(COUNT);
  for (int i = 0; i < COUNT; i++)
  {
    X[i] = (double) ((rand() % COUNT) - 5000);
    Y[i] = (double) ((rand() % COUNT) - 5000);
  }

  for (int i = 0; i < 2; i++)
  {
    long t = clock();
    thrust::device_vector<double> dX = X;
    thrust::device_vector<double> dY = Y;
    std::cout << "copy data from CPU to GPU " << ((clock() - t)*MS_PER_SEC)/CLOCKS_PER_SEC << " ms" << std::endl;
    t = clock();
    thrust::transform(dX.begin(), dX.end(), dY.begin(), dY.begin(), thrust::plus<double>());
    std::cout << "Calculation using Thrust device_vector (GPU): " << ((clock() - t)*MS_PER_SEC)/CLOCKS_PER_SEC << " ms" << std::endl;
    t = clock();
    thrust::copy(dY.begin(), dY.end(), Z.begin());
    std::cout << "copy result back to CPU: " << ((clock() - t)*MS_PER_SEC)/CLOCKS_PER_SEC << " ms" << std::endl;
  
  return 0;
}

on CUDA 6.5, RHEL 5.5, Quadro5000 GPU, I get the following results:

copy data from CPU to GPU 3580 ms
Calculation using Thrust device_vector (GPU): 0 ms
copy result back to CPU: 20 ms
copy data from CPU to GPU 20 ms
Calculation using Thrust device_vector (GPU): 0 ms
copy result back to CPU: 10 ms

sladapter · September 6, 2014, 3:39am

txbob,

I’ve found if I run the release version, I can get the result you were getting. I was running the debug version earlier. I know release version is faster than debug version, but why the difference is so huge with Thrust. I don’t see that much of a difference with other method such as using CUBLAS axpy function or self wrote kernelAdd function.

Robert_Crovella · September 6, 2014, 4:06am

Here is what I get when I compile with:

nvcc -arch=sm_20 -G -g -o t58 t58.cu

copy data from CPU to GPU 4450 ms
Calculation using Thrust device_vector (GPU): 0 ms
copy result back to CPU: 90 ms
copy data from CPU to GPU 30 ms
Calculation using Thrust device_vector (GPU): 0 ms
copy result back to CPU: 100 ms

So I don’t see much difference in the computation.

It should be noted that although the situation is improving with each new release, there may still be issues when running thrust codes that are compiled with -G:

https://github.com/thrust/thrust/wiki/Debugging

Compiling with -G can certainly have an effect on GPU computation time, because it disables most optimizations. In the case of a CUBLAS method, however, you are linking against a library call, where most of the work will be done. Compiling with -G has no effect on functions provided from the CUBLAS library (which are already compiled).

Probably I won’t be able to answer your question. Thrust is a header/template library, and so it tends to include a lot of boilerplate code, some of which will be optimized out by the compiler. When you disable these optimizations, it probably has a bigger effect than on a hand-written kernel that is already pretty simple.