Help requested for a three vector thrust::transform

cat159 · July 25, 2022, 7:22pm

Hi all,

I am trying to do something in thrust akin to s=x*y+z where all vectors are of the same size and all operations are done element-wise; I understand how to program it via two separate thrust::transforms() but I am looking to reduce it down to a single transform call to try and use the fused multiplication addition feature of the GPU hardware.

I’ve scrounged around on the internet and it seems like it would need some combination of zip_iterators and transform_iterators to get this to work, but the specifics appear to be eluding me.

Any help would be greatly appreciated.

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/complex.h>
#include <chrono>

#include <thrust/transform.h>
#include <thrust/functional.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/zip_iterator.h>

using namespace thrust::placeholders;
using namespace thrust;

#include <iostream>
typedef thrust::tuple<thrust::complex<float> const, thrust::complex<float> const, thrust::complex<float> const> triplet;

constexpr int repeat = 100000000;

int zip() {
  thrust::device_vector<thrust::complex<float>> d_vec1(repeat, thrust::complex<float>(1.0f, 1.0f)),
    d_vec2(repeat, thrust::complex<float>(1.0f, 1.0f)), 
   d_vec3(repeat, thrust::complex<float>(1.0f, 1.0f)),
   d_vec4(repeat, thrust::complex<float>(1.0f, 1.0f)); 

  std::chrono::time_point<std::chrono::high_resolution_clock> t0 = std::chrono::high_resolution_clock::now();

  auto start_zip = thrust::make_zip_iterator(thrust::make_tuple(d_vec1.begin(), d_vec2.begin(), d_vec3.begin()));

  auto end_zip = thrust::make_zip_iterator(thrust::make_tuple(d_vec1.end(), d_vec2.end(), d_vec3.end()));

  // auto thisf = thrust::make_transform_iterator(start_zip, end_zip, func());

  thrust::transform(start_zip, end_zip, d_vec4, [=] __device__ (triplet t) {
    return thrust::get<0>(t) * thrust::get<1>(t) + thrust::get<2>(t);
  } );

  std::chrono::time_point<std::chrono::high_resolution_clock> t1 = std::chrono::high_resolution_clock::now();
  return std::chrono::duration_cast<std::chrono::milliseconds>(t1 - t0).count();
}


int two_calls(){
  thrust::device_vector<thrust::complex<float>> 
    d_vec1(repeat,thrust::complex<float> (10.0, 2.0)), 
    d_vec2(repeat,thrust::complex<float> (10.0, 2.0)), 
    d_vec3(repeat,thrust::complex<float> (10.0, 2.0)),
    d_vec4(repeat,thrust::complex<float> (10.0, 2.0)); 

  std::chrono::time_point<std::chrono::high_resolution_clock> t0 = std::chrono::high_resolution_clock::now();

  thrust::transform(d_vec1.begin(), d_vec1.begin(), d_vec2.begin(), d_vec2.begin(), thrust::multiplies<complex<float>>());

  thrust::transform(d_vec2.begin(), d_vec2.begin(), d_vec3.begin(), d_vec4.begin(), thrust::plus<complex<float>>());

  std::chrono::time_point<std::chrono::high_resolution_clock> t1 = std::chrono::high_resolution_clock::now();
  return std::chrono::duration_cast<std::chrono::milliseconds>(t1 - t0).count();

}

int main(){
    
    // For loop over zip() 5 times
    for(int i = 0; i < 5; i++){
        zip();
    }

  std::cout << "zip:" <<  zip() << "ms" << std::endl;
  std::cout << "two_calls:" << two_calls() << "ms" << std::endl;

}

Robert_Crovella · July 25, 2022, 8:29pm

The only problem I see in your code is here:

thrust::transform(start_zip, end_zip, d_vec4, [=] __device__ (triplet t) {
                                         ^^^

thrust::transform expects an iterator in that position. But d_vec4 is the name of a container. If you switch that to d_vec4.begin() you will be off to the races.

Also, don’t forget to compile your code with --extended-lambda

cat159 · July 25, 2022, 8:41pm

Thanks Robert for the solution and alternative ideas.
I like the idea of using _1*_2 + _3. I think that would make the most readable code. Is there a way in the end to make it concisely like that or would you have to define a lambda to this type of math?

Robert_Crovella · July 25, 2022, 8:42pm

Sorry, I removed that code. I made an error. I suggest using the lambda method or a “ordinary” functor method.

cat159 · July 25, 2022, 8:44pm

Ok no worries. I was hoping for something nice and concise but a lambda works.

Would that type of placeholder math (or three input vector) be something that a future thrust release could include?

Robert_Crovella · July 25, 2022, 8:44pm

github.com/NVIDIA/thrust

Placeholders should work with zip_iterator

opened 03:39PM - 03 Jul 19 UTC

rconnorlawson

In my code, I often have scenarios where I would like to define predicates and b…inary ops on tuples, for example in complicated reductions. Currently, I use functor classes where necessary. For example, ```c++ struct first_equal { using Tuple = thrust::tuple<int, float>; __device__ const Tuple& operator()(const Tuple& t1, const Tuple& t2) { return thrust::get<0>(t1) == thrust::get<0>(t2); } }; struct min_second { using Tuple = thrust::tuple<int, float>; __device__ const Tuple& operator()(const Tuple& t1, const Tuple& t2) { return (thrust::get<1>(t1) < thrust::get<1>(t2)) ? t1 : t2; } }; ... thrust::reduce( thrust::make_zip_iterator(thrust::make_tuple( input_ints.begin(), input_floats.begin())), thrust::make_zip_iterator(thrust::make_tuple( input_ints.end(), input_floats.end())), thrust::make_zip_iterator(thrust::make_tuple( output_ints.end(), output_floats.end())), first_equal(), min_second() ); ``` It would be nice to express these functors using placeholders. Two possible syntaxes: ```c++ get_notation: first_equal -> get<0>(_1) == get<0>(_2) min_second -> (get<1>(_1) < get<1>(_2)) ? _1 : _2 index_notation: first_equal -> _1[0] == _2[0] min_second -> (_1[1] < _2[1]) ? _1 : _2 ``` Both of these give long template errors (missing specializations of `thrust::get`; no `operator[]` for placeholder types) when using Thrust 1.9.5 as shipped with CUDA 10.1 Update 1. As far as I can tell, there is no placeholder syntax for tuples.

cat159 · July 25, 2022, 8:57pm

Hi Robert,

Based off the timing for repeat = 1000000, I get the result

zip:49ms
two_calls:25ms

Based off your experience with CUDA, does it make sense that two separate thrust calls would be faster than a zipped tuple or is there a flag I’m missing that would help?

Robert_Crovella · July 25, 2022, 9:26pm

you have various errors in your two_calls function. Your thrust::transform calls are not constructed correctly.

Before attempting to confirm a performance comparison between two methods, first establish equivalence of the results.

when doing GPU timing of possibly asynchronous calls, it’s good practice to put a cudaDeviceSynchronize() in before closing a host-based timing region.

Make sure you are not compiling any codes with -G when doing perf analysis.

When I make the following changes to your code and run it on my V100, I get a result that shows that the two calls method is slower than the zip method:

$ cat t2070.cu
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/complex.h>
#include <chrono>

#include <thrust/transform.h>
#include <thrust/functional.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/zip_iterator.h>

using namespace thrust::placeholders;
using namespace thrust;
using mt = thrust::complex<float>;

struct my_func {

template <typename T>
__host__ __device__
  mt operator()(T t){
    return thrust::get<0>(t) * thrust::get<1>(t) + thrust::get<2>(t);}
};

#include <iostream>
typedef thrust::tuple<mt const, mt const, mt const> triplet;

constexpr int repeat = 100000000;

int zip() {
  thrust::device_vector<mt>
    d_vec1(repeat,mt (10.0, 2.0)),
    d_vec2(repeat,mt (10.0, 2.0)),
    d_vec3(repeat,mt (10.0, 2.0)),
    d_vec4(repeat,mt (10.0, 2.0));

  std::chrono::time_point<std::chrono::high_resolution_clock> t0 = std::chrono::high_resolution_clock::now();

  auto start_zip = thrust::make_zip_iterator(thrust::make_tuple(d_vec1.begin(), d_vec2.begin(), d_vec3.begin()));

  auto end_zip = thrust::make_zip_iterator(thrust::make_tuple(d_vec1.end(), d_vec2.end(), d_vec3.end()));

  // auto thisf = thrust::make_transform_iterator(start_zip, end_zip, func());
#ifdef USE_LAMBDA
  thrust::transform(start_zip, end_zip, d_vec4.begin(), [=] __device__ (triplet t) {
    return thrust::get<0>(t) * thrust::get<1>(t) + thrust::get<2>(t);
  } );
#else
  thrust::transform(start_zip, end_zip, d_vec4.begin(), my_func());
#endif
  cudaDeviceSynchronize();
  std::chrono::time_point<std::chrono::high_resolution_clock> t1 = std::chrono::high_resolution_clock::now();
  mt r = d_vec4[0];
  std::cout << "zip: " << r.real() << std::endl;
  return std::chrono::duration_cast<std::chrono::milliseconds>(t1 - t0).count();
}


int two_calls(){
  thrust::device_vector<mt>
    d_vec1(repeat,mt (10.0, 2.0)),
    d_vec2(repeat,mt (10.0, 2.0)),
    d_vec3(repeat,mt (10.0, 2.0)),
    d_vec4(repeat,mt (10.0, 2.0));

  std::chrono::time_point<std::chrono::high_resolution_clock> t0 = std::chrono::high_resolution_clock::now();

  thrust::transform(d_vec1.begin(), d_vec1.end(), d_vec2.begin(), d_vec2.begin(), thrust::multiplies<mt>());

  thrust::transform(d_vec2.begin(), d_vec2.end(), d_vec3.begin(), d_vec4.begin(), thrust::plus<mt>());
  cudaDeviceSynchronize();
  std::chrono::time_point<std::chrono::high_resolution_clock> t1 = std::chrono::high_resolution_clock::now();
  mt r = d_vec4[0];
  std::cout << "two calls: " << r.real() << std::endl;
  return std::chrono::duration_cast<std::chrono::milliseconds>(t1 - t0).count();

}

int main(){

  std::cout << "zip:" <<  zip() << "ms" << std::endl;
  std::cout << "two_calls:" << two_calls() << "ms" << std::endl;

}
$ nvcc -o t2070 t2070.cu -DUSE_LAMBDA --extended-lambda
$ ./t2070
zip:zip: 106
4ms
two_calls:two calls: 106
6ms
$

cat159 · July 25, 2022, 10:10pm

Thank you Robert for the help. I did catch the wrong iterator (begin() instead of end()) after I posted but didn’t update the post. Moving forward, I’ll update the posts. Sorry about that.

Thanks for the catch about -G. I didn’t realize the performance penalty that would cause. Removing the -G on my end showed similar comparative results to yours when using your posted code (zip being faster, specifics below).

As for playing around with lambda vs my_func(), on my end it performed the same on multiple runs so I guess they both are compiled to the same instruction set. Neat.

Tesla K80

[cat159@node69 testing-zip]$ nvcc -std=c++11 alt.cu --expt-extended-lambda -DUSE_LAMBDA
[cat159@node69 testing-zip]$ ./a.out 
zip: 106
zip:24ms
two calls: 106
two_calls:29ms

NVIDIA GeForce RTX 2070 with Max-Q Design (shared laptop so maybe someone is running something in the background)

➜  nvcc -std=c++11 alt.cu --expt-extended-lambda -DUSE_LAMBDA
➜  ./a.out                                      
zip:zip: 106
205ms
two_calls:two calls: 106
369ms

Robert_Crovella · July 25, 2022, 10:23pm

This is the output I get when I run the same test compiling with -G:

$ nvcc -o t2070 t2070.cu -DUSE_LAMBDA --extended-lambda -G
$ ./t2070
zip:zip: 106
2226ms
two_calls:two calls: 106
1127ms
$

The zip/lambda duration increases by a factor of ~500x
And the two calls duration changes by a factor that is noticeably different than 500x.

The vagaries of -G.

system · August 8, 2022, 10:24pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to optimize functor for transform iterators CUDA Programming and Performance	15	1848	June 19, 2019
How to efficiently sort 5 arrays of integers? CUDA Programming and Performance	7	1162	June 19, 2015
slower thrust reductions on Windows 7/8 compared to Linux? CUDA Programming and Performance	3	730	November 10, 2013
Thrust::async::for_each() with zip_iterators CUDA Programming and Performance	3	535	January 30, 2023
Using Thrust to sort Unified Memory Buffer? GPU-Accelerated Libraries	8	5061	May 7, 2015
Cuda code performance CUDA Programming and Performance	14	3104	December 16, 2014
how to implement double for loops in CUDA CUDA Programming and Performance	23	15748	January 30, 2012
thrust::exclusive_scan with thrust::zip_iterator? CUDA Programming and Performance	9	1542	November 24, 2014
Yet another performance question CUDA Programming and Performance	16	4228	February 12, 2009
Thrust (cuda version 8) compiling with lots of noise on Windows 10? CUDA Programming and Performance	3	839	December 3, 2016

Help requested for a three vector thrust::transform

Related topics