Help requested for a three vector thrust::transform

Hi all,

I am trying to do something in thrust akin to s=x*y+z where all vectors are of the same size and all operations are done element-wise; I understand how to program it via two separate thrust::transforms() but I am looking to reduce it down to a single transform call to try and use the fused multiplication addition feature of the GPU hardware.

I’ve scrounged around on the internet and it seems like it would need some combination of zip_iterators and transform_iterators to get this to work, but the specifics appear to be eluding me.

Any help would be greatly appreciated.

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/complex.h>
#include <chrono>

#include <thrust/transform.h>
#include <thrust/functional.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/zip_iterator.h>

using namespace thrust::placeholders;
using namespace thrust;

#include <iostream>
typedef thrust::tuple<thrust::complex<float> const, thrust::complex<float> const, thrust::complex<float> const> triplet;

constexpr int repeat = 100000000;

int zip() {
  thrust::device_vector<thrust::complex<float>> d_vec1(repeat, thrust::complex<float>(1.0f, 1.0f)),
    d_vec2(repeat, thrust::complex<float>(1.0f, 1.0f)), 
   d_vec3(repeat, thrust::complex<float>(1.0f, 1.0f)),
   d_vec4(repeat, thrust::complex<float>(1.0f, 1.0f)); 

  std::chrono::time_point<std::chrono::high_resolution_clock> t0 = std::chrono::high_resolution_clock::now();

  auto start_zip = thrust::make_zip_iterator(thrust::make_tuple(d_vec1.begin(), d_vec2.begin(), d_vec3.begin()));

  auto end_zip = thrust::make_zip_iterator(thrust::make_tuple(d_vec1.end(), d_vec2.end(), d_vec3.end()));

  // auto thisf = thrust::make_transform_iterator(start_zip, end_zip, func());

  thrust::transform(start_zip, end_zip, d_vec4, [=] __device__ (triplet t) {
    return thrust::get<0>(t) * thrust::get<1>(t) + thrust::get<2>(t);
  } );

  std::chrono::time_point<std::chrono::high_resolution_clock> t1 = std::chrono::high_resolution_clock::now();
  return std::chrono::duration_cast<std::chrono::milliseconds>(t1 - t0).count();
}


int two_calls(){
  thrust::device_vector<thrust::complex<float>> 
    d_vec1(repeat,thrust::complex<float> (10.0, 2.0)), 
    d_vec2(repeat,thrust::complex<float> (10.0, 2.0)), 
    d_vec3(repeat,thrust::complex<float> (10.0, 2.0)),
    d_vec4(repeat,thrust::complex<float> (10.0, 2.0)); 

  std::chrono::time_point<std::chrono::high_resolution_clock> t0 = std::chrono::high_resolution_clock::now();

  thrust::transform(d_vec1.begin(), d_vec1.begin(), d_vec2.begin(), d_vec2.begin(), thrust::multiplies<complex<float>>());

  thrust::transform(d_vec2.begin(), d_vec2.begin(), d_vec3.begin(), d_vec4.begin(), thrust::plus<complex<float>>());

  std::chrono::time_point<std::chrono::high_resolution_clock> t1 = std::chrono::high_resolution_clock::now();
  return std::chrono::duration_cast<std::chrono::milliseconds>(t1 - t0).count();

}

int main(){
    
    // For loop over zip() 5 times
    for(int i = 0; i < 5; i++){
        zip();
    }

  std::cout << "zip:" <<  zip() << "ms" << std::endl;
  std::cout << "two_calls:" << two_calls() << "ms" << std::endl;

}


The only problem I see in your code is here:

thrust::transform(start_zip, end_zip, d_vec4, [=] __device__ (triplet t) {
                                         ^^^

thrust::transform expects an iterator in that position. But d_vec4 is the name of a container. If you switch that to d_vec4.begin() you will be off to the races.

Also, don’t forget to compile your code with --extended-lambda

Thanks Robert for the solution and alternative ideas.
I like the idea of using _1*_2 + _3. I think that would make the most readable code. Is there a way in the end to make it concisely like that or would you have to define a lambda to this type of math?

Sorry, I removed that code. I made an error. I suggest using the lambda method or a “ordinary” functor method.

1 Like

Ok no worries. I was hoping for something nice and concise but a lambda works.

Would that type of placeholder math (or three input vector) be something that a future thrust release could include?

1 Like

Hi Robert,

Based off the timing for repeat = 1000000, I get the result

zip:49ms
two_calls:25ms

Based off your experience with CUDA, does it make sense that two separate thrust calls would be faster than a zipped tuple or is there a flag I’m missing that would help?

you have various errors in your two_calls function. Your thrust::transform calls are not constructed correctly.

Before attempting to confirm a performance comparison between two methods, first establish equivalence of the results.

when doing GPU timing of possibly asynchronous calls, it’s good practice to put a cudaDeviceSynchronize() in before closing a host-based timing region.

Make sure you are not compiling any codes with -G when doing perf analysis.

When I make the following changes to your code and run it on my V100, I get a result that shows that the two calls method is slower than the zip method:

$ cat t2070.cu
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/complex.h>
#include <chrono>

#include <thrust/transform.h>
#include <thrust/functional.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/zip_iterator.h>

using namespace thrust::placeholders;
using namespace thrust;
using mt = thrust::complex<float>;

struct my_func {

template <typename T>
__host__ __device__
  mt operator()(T t){
    return thrust::get<0>(t) * thrust::get<1>(t) + thrust::get<2>(t);}
};

#include <iostream>
typedef thrust::tuple<mt const, mt const, mt const> triplet;

constexpr int repeat = 100000000;

int zip() {
  thrust::device_vector<mt>
    d_vec1(repeat,mt (10.0, 2.0)),
    d_vec2(repeat,mt (10.0, 2.0)),
    d_vec3(repeat,mt (10.0, 2.0)),
    d_vec4(repeat,mt (10.0, 2.0));

  std::chrono::time_point<std::chrono::high_resolution_clock> t0 = std::chrono::high_resolution_clock::now();

  auto start_zip = thrust::make_zip_iterator(thrust::make_tuple(d_vec1.begin(), d_vec2.begin(), d_vec3.begin()));

  auto end_zip = thrust::make_zip_iterator(thrust::make_tuple(d_vec1.end(), d_vec2.end(), d_vec3.end()));

  // auto thisf = thrust::make_transform_iterator(start_zip, end_zip, func());
#ifdef USE_LAMBDA
  thrust::transform(start_zip, end_zip, d_vec4.begin(), [=] __device__ (triplet t) {
    return thrust::get<0>(t) * thrust::get<1>(t) + thrust::get<2>(t);
  } );
#else
  thrust::transform(start_zip, end_zip, d_vec4.begin(), my_func());
#endif
  cudaDeviceSynchronize();
  std::chrono::time_point<std::chrono::high_resolution_clock> t1 = std::chrono::high_resolution_clock::now();
  mt r = d_vec4[0];
  std::cout << "zip: " << r.real() << std::endl;
  return std::chrono::duration_cast<std::chrono::milliseconds>(t1 - t0).count();
}


int two_calls(){
  thrust::device_vector<mt>
    d_vec1(repeat,mt (10.0, 2.0)),
    d_vec2(repeat,mt (10.0, 2.0)),
    d_vec3(repeat,mt (10.0, 2.0)),
    d_vec4(repeat,mt (10.0, 2.0));

  std::chrono::time_point<std::chrono::high_resolution_clock> t0 = std::chrono::high_resolution_clock::now();

  thrust::transform(d_vec1.begin(), d_vec1.end(), d_vec2.begin(), d_vec2.begin(), thrust::multiplies<mt>());

  thrust::transform(d_vec2.begin(), d_vec2.end(), d_vec3.begin(), d_vec4.begin(), thrust::plus<mt>());
  cudaDeviceSynchronize();
  std::chrono::time_point<std::chrono::high_resolution_clock> t1 = std::chrono::high_resolution_clock::now();
  mt r = d_vec4[0];
  std::cout << "two calls: " << r.real() << std::endl;
  return std::chrono::duration_cast<std::chrono::milliseconds>(t1 - t0).count();

}

int main(){

  std::cout << "zip:" <<  zip() << "ms" << std::endl;
  std::cout << "two_calls:" << two_calls() << "ms" << std::endl;

}
$ nvcc -o t2070 t2070.cu -DUSE_LAMBDA --extended-lambda
$ ./t2070
zip:zip: 106
4ms
two_calls:two calls: 106
6ms
$

Thank you Robert for the help. I did catch the wrong iterator (begin() instead of end()) after I posted but didn’t update the post. Moving forward, I’ll update the posts. Sorry about that.

Thanks for the catch about -G. I didn’t realize the performance penalty that would cause. Removing the -G on my end showed similar comparative results to yours when using your posted code (zip being faster, specifics below).

As for playing around with lambda vs my_func(), on my end it performed the same on multiple runs so I guess they both are compiled to the same instruction set. Neat.

Tesla K80

[cat159@node69 testing-zip]$ nvcc -std=c++11 alt.cu --expt-extended-lambda -DUSE_LAMBDA
[cat159@node69 testing-zip]$ ./a.out 
zip: 106
zip:24ms
two calls: 106
two_calls:29ms

NVIDIA GeForce RTX 2070 with Max-Q Design (shared laptop so maybe someone is running something in the background)

➜  nvcc -std=c++11 alt.cu --expt-extended-lambda -DUSE_LAMBDA
➜  ./a.out                                      
zip:zip: 106
205ms
two_calls:two calls: 106
369ms

This is the output I get when I run the same test compiling with -G:

$ nvcc -o t2070 t2070.cu -DUSE_LAMBDA --extended-lambda -G
$ ./t2070
zip:zip: 106
2226ms
two_calls:two calls: 106
1127ms
$

The zip/lambda duration increases by a factor of ~500x
And the two calls duration changes by a factor that is noticeably different than 500x.

The vagaries of -G.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.