How to optimize functor for transform iterators

ululi1970 · April 17, 2019, 6:03pm

We are porting a CFD code to GPU using thrust. our fields are defined on 4D lattices, and often we need to operate on subset of the lattice. For this, we combine a transform iterator with a permutation iterator. However, we take a significant performance hit when we do that, up toa factor 10 slower on CPU, and a factor 2 slower on GPU. The performance degradation in CPU is particularly worrisome since not all of our user base has access to GPU. That is why we use thrust in first place.
The permutation iterator has the structure

auto It = thrust::make_permutation_iterator(Iterator,
                                                thrust::make_transform_iterator(
                                                thrust::counting_iterator<unsigned int>(0), functor));

where Iterator is the standard iterator associated to host or device vectors, and functor is an instantiation of this struct

/*Functor that transforms a sequence of integers 0 <= idx < N into the position
indices of a subvolume of a volume. The volume is specified as [0, nx)x[0,ny)x[0,nz)x[0,nq),
while the subvolume  is given in numpy notation as [lx:-hx, ly:-hy, lz:-hz, lq:-hq]. 
N=(nx-(lx+hx))*(ny-(ly+hy))*etc. It is used to make transform iterators that are fed to 
permutation iterators to limit calculations to the subvolume.
*/
template <typename IndType>
struct subvolume_map
{
    IndType nx, ny, nz, nq, lx, ly, lz, lq, dx, dy, dz, dq;
    
    subvolume_map(std::array<IndType, 3> _n, std::array<IndType, 3> _l, std::array<IndType, 3> _h,
                  IndType _nq, IndType _lq, IndType _hq) : nx(_n[0]), ny(_n[1]), nz(_n[2]), nq(_nq),
                                                           lx(_l[0]), ly(_l[1]), lz(_l[2]), lq(_lq),
                                                           dx(_n[0] - (_l[0] + _h[0])),
                                                           dy(_n[1] - (_l[1] + _h[1])),
                                                           dz(_n[2] - (_l[2] + _h[2])) {}
                                                           
   

   
    __host__ __device__
        IndType
        operator()(IndType idx)
    {

        
        auto rx = idx / dx;
#ifndef __CUDA_ARCH__
        auto ix = idx % dx + lx;
#else
        auto ix = idx - (rx * dx) + lx;
#endif
        auto ry = rx / dy;
#ifndef __CUDA_ARCH__
        auto iy = rx % dy + ly;
#else
        auto iy = rx - (ry * dy) + ly;
#endif
        auto rz = ry / dz;
#ifndef __CUDA_ARCH__
        auto iz = ry % dz + lz;
#else
        auto iz = ry - (rz * dz) + lz;
#endif
        auto iq = rz + lq;

        return (((iq)*nz + iz) * ny + iy) * nx + ix;
        
        
        // }
    }
};

Note by setting _l, _h, _lq and _lh to zero, this functor is equivalent to thrust::identity. Even in this case, we observe large performance drops (as compared to using thrust::identity as the functor), which suggests that the problem is not with memory bandwidth issues (e.g., cache misses in CPU), but with the efficiency (or lack thereof) of the operator(). Any suggestion on how to improve on this state of affairs is welcome.

Robert_Crovella · April 17, 2019, 6:45pm

Using thrust with 6 threads on CPU
BandWidth (MB/s) using subvolume functor on array iterator 763
BandWidth (MB/s) using identity functor on array iterator 9822

Using thrust on GPU
BandWidth (MB/s) using subvolume functor 107131
BandWidth (MB/s) using identity functor 174938

In this test, the access pattern is the same between the two cases, so the difference (we think!) must be due to the cost of invoking subvolume_map(n) vs n.

Robert_Crovella · April 17, 2019, 7:33pm

At least on the GPU, I would expect the functor to be memory bandwidth bound for operations like filling memory, but that may not be the case. The functor does 4 integer divisions (the most expensive of the various operations it does), and this may be converting from memory bound to compute (integer) bound.

I’m not sure why you partitioned the functor behavior between CPU and GPU. I would expect modulo arithmetic to be more expensive than the realization I created. Have you compared both cases on the CPU?

The behavior of the functor (and the GPU in general) will be best (i.e. best amortized) if you do complex operations, not trivial operations like filling memory. That might be a worst case measurement.

ululi1970 · April 18, 2019, 2:46am

The partitioning of the functor produced very modest gains, but we are talking a few percentage points at the most, and mostly on CPU, where x86 provides a single instruction to calculate / and remainder.
However, our concern is mostly with the severe degradation in performance on the CPU side (as I mentioned, not all our users
have access to GPUs). I agree that using fill exposes the worst possible scenario, but we observe similar behavior for more complicated operations, e.g. a+=a*b type of operations.

Interestingly, if we modify the functor operator() to check if nx==dx, ny==dy, nz==dz and if so to return
idx+(idx/(nxnynz)+lq)nxny*nz
we obtain a considerable speedup, in the sense that even on CPU we are within a factor 3 relative to using thrust::identity.
If indeed we set the problem so that the last offset is zero, and we return idx, then there is no degradation relative to using
thrust::identity. At least this shows that, despite the fact that the default functor copy constructor is invoked at every iteration (
we base this assertion on tests in which we had an explicit destructor that logged a call to it) it is not that part that slows down

njuffa · April 18, 2019, 3:04am

As for the integer divisions and modulo operations with shared divisor on the GPU, generations of CUDA compilers have gone back and forth in their ability to optimize that into a division followed by backmultiply-subtraction to compute the modulo.

If you are concerned about the performance of this computation, I would suggest double-checking the generated machine code (SASS) to make sure you don’t get separate division / modulo operations, or as a conservative approach simply do the backmultiply-subtract for the modulo manually at HLL level.

Since there is no hardware support for division on the GPU, 64-bit integer divisions in particular are expensive, while 32-bit integer divisions are tolerable in cost (~20 instructions, if I recall correctly).

The CUDA toolchain uses standard optimizations for divisions by compile-time constants, so making some of those divisors template parameters would further speed up this computation.

Any discussion about the CPU and GPU performance should probably include information what CPUs and GPUs you are measuring on, as there is quite a bit of performance variability in GPU and CPU memory systems (in GPUs, mostly the type of memory i.e. HBM2 vs GDDR5, in CPUs mostly the number of DDR4 channels).

Robert_Crovella · April 18, 2019, 5:04am

The original code (which I wrote) explicitly does backmultiply-subtract instead of modulo, since it follows the requisite division operation.

The claim here apparently on the CPU side is that the modulo operation is faster than backmultiply-subtract.

njuffa · April 18, 2019, 6:36am

I see. I didn’t realize the code in the OP’s #1 had been written by you. It is true that the x86 IDIV instruction returns both quotient and remainder in one feel swoop. But it still requires a compiler optimization to recognize that a division and a modulo operation use the same divisor and thus can be accomplished with one IDIV.

The CUDA toolchain has that capability in principle but seems to regularly lose it for unknown reasons. I don’t know what the current status is.

Given that IDIV on x86 (~25 cycles) is much slower than IMUL (4 cycles), the backmultiply approach will be a bit slower compared to a single IDIV delivering both quotient and remainder, but not dramatically so.

ululi1970 · April 18, 2019, 4:15pm

Thank you both for your comments. I am going to post here the code we use for testing

#include <cmath>
#include <omp.h>
#include <cusp/array1d.h>
#include <thrust/iterator/discard_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/fill.h>
#include <thrust/functional.h>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/transform.h>
#include <chrono>
/* simple timer utility based on std::chrono
 Example of Usage is 
 tictoc<milliseconds> T();
 T.tic();
 { Code to be timed}
 std::cout << " Time elapsed since last call to tic in milliseconds is " << T.toc() << std::endl;

 T.reset() // reset internal elapsed to 0.

*/
template <typename DurationUnits>
class tictoc
{
private:
    double m_elapsed;
    std::chrono::time_point<std::chrono::steady_clock> m_start;

public:
    tictoc(double _a = (double)0) : m_elapsed(_a) {}

    void tic()
    {
        m_start = std::chrono::steady_clock::now();
    }
    double toc()
    {
        auto stop = std::chrono::steady_clock::now();
        m_elapsed += (double)std::chrono::duration_cast<std::chrono::duration<DurationUnits>>(stop - m_start).count();
        return m_elapsed;
    }

    void reset()
    {
        m_elapsed = (double)0;
    }
};

/*Functor that transforms a sequence of integers 0 <= idx < N into the position
indices of a subvolume of a volume. The volume is specified as [0, nx)x[0,ny)x[0,nz)x[0,nq),
while the subvolume  is given in numpy notation as [lx:-hx, ly:-hy, lz:-hz, lq:-hq]. 
N=(nx-(lx+hx))*(ny-(ly+hy))*etc. It is used to make transform iterators that are fed to 
permutation iterators to limit calculations to the subvolume.
*/
template <typename IndType>
struct subvolume_map
{
    IndType nx, ny, nz, nq, lx, ly, lz, lq, dx, dy, dz, dq;

    subvolume_map(std::array<IndType, 3> _n, std::array<IndType, 3> _l, std::array<IndType, 3> _h,
                  IndType _nq, IndType _lq, IndType _hq) : nx(_n[0]), ny(_n[1]), nz(_n[2]), nq(_nq),
                                                           lx(_l[0]), ly(_l[1]), lz(_l[2]), lq(_lq),
                                                           dx(_n[0] - (_l[0] + _h[0])),
                                                           dy(_n[1] - (_l[1] + _h[1])),
                                                           dz(_n[2] - (_l[2] + _h[2])) {}

    __host__ __device__
        IndType
        operator()(IndType idx)
    {

#if 0
        auto rx = idx / dx;
        auto ix = idx % dx + lx;
        auto ry = rx / dy;
        auto iy = rx % dy + ly;
        auto rz = ry / dz;
        auto iz = ry % dz + lz;
        auto iq = rz + lq;
        return (((iq)*nz + iz) * ny + iy) * nx + ix;
#else
        auto rx = idx / dx;
        auto ix = idx - (rx * dx);
        auto ry = rx / dy;
        auto iy = rx - (ry * dy);
        auto rz = ry / dz;
        auto iz = ry - (rz * dz);
        auto iq = rz;
        return (((iq + lq) * nz + iz + lz) * ny + iy + ly) * nx + ix + lx;
#endif
    }
};

template <typename T, typename F>
decltype(tictoc<double>().toc()) testFunctor(T it, F &f, unsigned int Ni)
{
    auto It = thrust::make_permutation_iterator(it,
                                                thrust::make_transform_iterator(
                                                    thrust::counting_iterator<unsigned int>(0), f));
    tictoc<double> Tind;
    Tind.reset();
    cudaDeviceSynchronize();
    Tind.tic();
    thrust::fill(It, It + Ni, 0.);
    cudaDeviceSynchronize();
    return Tind.toc();
}

template <typename T>
decltype(tictoc<double>().toc()) testFunctor(T it, unsigned int Ni)
{
    auto It = thrust::make_permutation_iterator(it,
                                                thrust::make_transform_iterator(
                                                    thrust::counting_iterator<unsigned int>(0), thrust::identity<unsigned int>()));
    tictoc<double> Tind;
    Tind.reset();
    cudaDeviceSynchronize();
    Tind.tic();
    thrust::fill(It, It + Ni, 0.);
    cudaDeviceSynchronize();
    return Tind.toc();
}
int main()
{
    int nt = 1;

    unsigned int nx = 256, ny = 512, nz = 392;
    //unsigned int nx = 2, ny = 2, nz = 2;
    // for these values, the functor subvolume_map behaves like the identity.
    std::array<unsigned int, 3> n = {nx, ny, nz};
    std::array<unsigned int, 3> l = {0, 0, 0};
    std::array<unsigned int, 3> h = {0, 0, 0};
    unsigned int nq = 1;
    unsigned int lq = 0;
    unsigned int hq = 0;
    unsigned int Ni = (nx - l[0] - h[0]) * (ny - l[1] - h[1]) * (nz - l[2] - h[2]) * (nq - lq - hq);
    unsigned int Nt = (nx) * (ny) * (nz) * (nq);
    thrust::host_vector<double> V(Nt);
    thrust::device_vector<double> Vd(Nt);
    subvolume_map<unsigned int> SV(n, l, h, nq, lq, hq);
    while (nt < 13)
    {
        omp_set_num_threads(nt);
        std::cout << " --------------------- " << std::endl;
        std::cout << " Using thrust with " << omp_get_max_threads() << " threads on CPU " << std::endl;

        auto T = testFunctor(V.begin(), SV, Ni);

        std::cout << " BandWidth (MB/s) using subvolume functor on actual iterator " << int(Ni * sizeof(double) / T / 1.e6) << std::endl;
        T = testFunctor(V.begin(), Ni);
        std::cout << " BandWidth (MB/s) using identity functor on actual iterator " << int(Ni * sizeof(double) / T / 1.e6) << std::endl;
        nt += 1;
        
    }

    std::cout << " --------------------- " << std::endl;
    std::cout << " Using thrust  on GPU " << std::endl;

    auto T = testFunctor(Vd.begin(), SV, Ni);
    std::cout << " BandWidth (MB/s) using subvolume functor " << int(Ni * sizeof(double) / T / 1.e6) << std::endl;
    T = testFunctor(Vd.begin(), Ni);
    std::cout << " BandWidth (MB/s) using identity functor " << int(Ni * sizeof(double) / T / 1.e6) << std::endl;

    std::cout << " testing the functor " << std::endl;

    tictoc<double> Tind;
    Tind.reset();
    Tind.tic();
    unsigned int i = 0;
    
    #pragma omp parallel
    #pragma omp for
    for (i=0; i < Ni; ++i)
    {
        V[SV(i)]=0.;
    }
    T = Tind.toc();
    std::cout << V[0] << std::endl;
    std::cout << " Bandwidth (MB/s) for subvolume functor in while loop " << int(Ni * sizeof(double) / T / 1.e6) << std::endl;

    Tind.reset();
    Tind.tic();
    i = 0;
    
     #pragma omp parallel
    #pragma omp for
    for (i = 0; i <Ni; ++i)
    {
        V[thrust::identity<unsigned int>()(i)]=0.;
    }
    T = Tind.toc();
    std::cout << V[0] << std::endl; // to avoid the compiler optimizing out the loop
    std::cout << " Bandwidth (MB/s) for identity functor in while loop " << int(Ni * sizeof(double) / T / 1.e6) << std::endl;
    return 0;
}

The difference between using the remainder vs backmultiply is actually pretty small. This is what we get using backmultiply

Using thrust with 1 threads on CPU
BandWidth (MB/s) using subvolume functor on actual iterator 650
BandWidth (MB/s) using identity functor on actual iterator 8704

Using thrust with 2 threads on CPU
BandWidth (MB/s) using subvolume functor on actual iterator 1290
BandWidth (MB/s) using identity functor on actual iterator 13111

Using thrust with 3 threads on CPU
BandWidth (MB/s) using subvolume functor on actual iterator 1943
BandWidth (MB/s) using identity functor on actual iterator 13310

Using thrust with 4 threads on CPU
BandWidth (MB/s) using subvolume functor on actual iterator 2586
BandWidth (MB/s) using identity functor on actual iterator 13248

Using thrust with 5 threads on CPU
BandWidth (MB/s) using subvolume functor on actual iterator 3214
BandWidth (MB/s) using identity functor on actual iterator 13177

Using thrust with 6 threads on CPU
BandWidth (MB/s) using subvolume functor on actual iterator 3841
BandWidth (MB/s) using identity functor on actual iterator 13131

Using thrust with 7 threads on CPU
BandWidth (MB/s) using subvolume functor on actual iterator 2812
BandWidth (MB/s) using identity functor on actual iterator 12853

Using thrust with 8 threads on CPU
BandWidth (MB/s) using subvolume functor on actual iterator 3170
BandWidth (MB/s) using identity functor on actual iterator 13035

Using thrust with 9 threads on CPU
BandWidth (MB/s) using subvolume functor on actual iterator 3374
BandWidth (MB/s) using identity functor on actual iterator 12938

Using thrust with 10 threads on CPU
BandWidth (MB/s) using subvolume functor on actual iterator 3695
BandWidth (MB/s) using identity functor on actual iterator 13014

Using thrust with 11 threads on CPU
BandWidth (MB/s) using subvolume functor on actual iterator 4040
BandWidth (MB/s) using identity functor on actual iterator 13011

Using thrust with 12 threads on CPU
BandWidth (MB/s) using subvolume functor on actual iterator 4310
BandWidth (MB/s) using identity functor on actual iterator 13000

Using thrust on GPU
BandWidth (MB/s) using subvolume functor 94887
BandWidth (MB/s) using identity functor 157701
testing the functor
0
Bandwidth (MB/s) for subvolume functor in while loop 4449
0
Bandwidth (MB/s) for identity functor in while loop 12907

Whereas this is what we get using the remainder

Using thrust with 1 threads on CPU
BandWidth (MB/s) using subvolume functor on actual iterator 696
BandWidth (MB/s) using identity functor on actual iterator 8632

Using thrust with 2 threads on CPU
BandWidth (MB/s) using subvolume functor on actual iterator 1378
BandWidth (MB/s) using identity functor on actual iterator 13153

Using thrust with 3 threads on CPU
BandWidth (MB/s) using subvolume functor on actual iterator 2077
BandWidth (MB/s) using identity functor on actual iterator 13142

Using thrust with 4 threads on CPU
BandWidth (MB/s) using subvolume functor on actual iterator 2754
BandWidth (MB/s) using identity functor on actual iterator 13256

Using thrust with 5 threads on CPU
BandWidth (MB/s) using subvolume functor on actual iterator 3450
BandWidth (MB/s) using identity functor on actual iterator 13178

Using thrust with 6 threads on CPU
BandWidth (MB/s) using subvolume functor on actual iterator 4116
BandWidth (MB/s) using identity functor on actual iterator 13202

Using thrust with 7 threads on CPU
BandWidth (MB/s) using subvolume functor on actual iterator 2898
BandWidth (MB/s) using identity functor on actual iterator 12563

Using thrust with 8 threads on CPU
BandWidth (MB/s) using subvolume functor on actual iterator 3008
BandWidth (MB/s) using identity functor on actual iterator 13008

Using thrust with 9 threads on CPU
BandWidth (MB/s) using subvolume functor on actual iterator 3382
BandWidth (MB/s) using identity functor on actual iterator 12903

Using thrust with 10 threads on CPU
BandWidth (MB/s) using subvolume functor on actual iterator 3757
BandWidth (MB/s) using identity functor on actual iterator 13074

Using thrust with 11 threads on CPU
BandWidth (MB/s) using subvolume functor on actual iterator 4107
BandWidth (MB/s) using identity functor on actual iterator 13013

Using thrust with 12 threads on CPU
BandWidth (MB/s) using subvolume functor on actual iterator 4393
BandWidth (MB/s) using identity functor on actual iterator 13045

Using thrust on GPU
BandWidth (MB/s) using subvolume functor 105597
BandWidth (MB/s) using identity functor 156678
testing the functor
0
Bandwidth (MB/s) for subvolume functor in while loop 4478
0
Bandwidth (MB/s) for identity functor in while loop 13073

The difference is minimal, compared to the difference between functor and identity. These tests were performed on a (oldish) machine with a Intel(R) Core™ i7 CPU X 990 @ 3.47GHz (6 physical cores, 12 virtual cores). Memory is DDR2. The GPU is a vintage GeForce GTX 680, 2GB ot total memory. nvcc V9.0, thrust V1.9.0, gcc V7.3.1, and compiled with -O3 optimization.

If I read the data correctly, on CPU the process seems CPU bound using the functor, since bandwidth saturates when we hit 6 threads, whereas using the identity functor it saturates at 2 threads. I guess the next step would be to inspect the assembly for the functor…

njuffa · April 18, 2019, 4:40pm

A bit older than “oldish”, I would claim :-) Intel’s ARK database says the memory subsystem of the CPU is based on DDR3-1066, providing a theoretical bandwidth of 25.6 GB/sec. The GTX 680 is the same vintage as the CPU (around 2012), with theoretical bandwidth of 192 GB/sec.

In light of this your measured performance numbers make a lot more sense than they did before. The small difference between using the modulo operation (optimized away by the compiler) and explicit backmultiply also makes perfect sense.

ululi1970 · April 18, 2019, 4:55pm

Still, enough to explain the factor 3 to 10 difference on CPU between functor and identity? On the GPU side, the identity approaches the theoretical limit, the functor is still almost half as slow…
And yes, I will use your comment to make a case for getting new machines with my boss :-)

njuffa · April 18, 2019, 5:05pm

I haven’t done any serious optimization work on x86 machines for 15+ years, so I cannot really comment on why the difference between functor and identity is so large for the CPU version. Generally speaking, I would say “the profiler is your friend”. VTUNE should be able to tell you quite a bit about bottlenecks in your code.

I am well aware that the universal lot of all software developers is to have ancient hardware furnished to them by their employers :-)

Robert_Crovella · April 22, 2019, 6:47pm

Mostly my focus has been on the GPU side of things. I believe that the observations here are basically correct, that the functor with the four division operations is cutting the achieved bandwidth approximately in half (on some GPUs - see below). I attribute this to the amount of integer pressure it is creating.

On a tesla P100, for example, I witness this:

Using thrust  on GPU
 BandWidth (MB/s) using subvolume functor 268889
 BandWidth (MB/s) using identity functor 575854

Tesla V100 has improved integer handling, and I observe this:

Using thrust  on GPU
 BandWidth (MB/s) using subvolume functor 848096
 BandWidth (MB/s) using identity functor 859069

I’ve done some GPU profiling to confirm that global store efficiency is 100% in both cases (division and identity functor) and that the total number of transactions is the same. The GPU profiler also reports approximately half dram bandwidth achieved (on the P100) for the division functor case.

So I would conclude that this 1D->4D->1D mapping functor is “expensive” in terms of integer pressure, and a possible workaround would be to use a GPU that has improved integer throughput (volta/turing). If you had an expensive enough operation to perform, the cost of the functor might be more readily amortized over more reads and writes.

If you could ensure that your subvolume always had dimensions that were powers of 2, you might be able to improve the functor by converting divisions to shift operations.

Since CUDA has the ability to launch 3 dimensional grids, much of the 1D->4D->1D index generation associated with thrust could be dispensed with, by converting to ordinary CUDA. It should still be possible to create basic templated operations that get passed to a “generic” kernel, such as with thrust. A simple transform kernel would be trivial to demonstrate in CUDA.

ululi1970 · April 23, 2019, 6:08pm

Thanks Robert. The numbers on the Tesla are impressive!

ululi1970 · June 19, 2019, 2:28pm

Mostly for the benefit of those who may land on this thread, I can confirm that Robert’s suggestion of using bitwise shift operators when the sublattice dimensions are a power of 2 produces a speed up
in CPU of a factor 2. On GPU, it depends: on our old GTX 680, the speedup is a factor 1.5.
We recently got a Titan V (thank you boss!), and there the speedup is negligible. However,
on the Titan V the original code was already very close to the identity functor, so not much room to
improve.

Topic		Replies	Views
Accuracy of 1D linear interpolation by CUDA texture interpolation CUDA Programming and Performance	25	14598	January 29, 2013
advice needed by a PhD student CUDA Programming and Performance	26	2847	December 4, 2011
Possible to use the CUDA math API integer intrinsics to find the nth unset bit in a 32 bit int CUDA Programming and Performance	37	8375	March 1, 2015
My simple but speedy reduction code (runs 106.4GB/s on GTX 295) 106.4/111.9=95.1% to the peak bandwi CUDA Programming and Performance	32	28214	August 15, 2010
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16290	January 30, 2011
Performance Issues with CUDA and Python/R CUDA Programming and Performance	16	6233	August 26, 2016
cuda fortran questions Legacy PGI Compilers	10	10942	July 27, 2012
Modern GPU CUDA Programming and Performance	30	5655	April 11, 2016
CUDA very slow performance CUDA Programming and Performance	21	16462	March 6, 2020
Efficient use of shared memory CUDA Programming and Performance	29	4221	December 2, 2019

How to optimize functor for transform iterators

Using thrust with 6 threads on CPU BandWidth (MB/s) using subvolume functor on array iterator 763 BandWidth (MB/s) using identity functor on array iterator 9822

Using thrust with 1 threads on CPU BandWidth (MB/s) using subvolume functor on actual iterator 650 BandWidth (MB/s) using identity functor on actual iterator 8704

Using thrust with 2 threads on CPU BandWidth (MB/s) using subvolume functor on actual iterator 1290 BandWidth (MB/s) using identity functor on actual iterator 13111

Using thrust with 3 threads on CPU BandWidth (MB/s) using subvolume functor on actual iterator 1943 BandWidth (MB/s) using identity functor on actual iterator 13310

Using thrust with 4 threads on CPU BandWidth (MB/s) using subvolume functor on actual iterator 2586 BandWidth (MB/s) using identity functor on actual iterator 13248

Using thrust with 5 threads on CPU BandWidth (MB/s) using subvolume functor on actual iterator 3214 BandWidth (MB/s) using identity functor on actual iterator 13177

Using thrust with 6 threads on CPU BandWidth (MB/s) using subvolume functor on actual iterator 3841 BandWidth (MB/s) using identity functor on actual iterator 13131

Using thrust with 7 threads on CPU BandWidth (MB/s) using subvolume functor on actual iterator 2812 BandWidth (MB/s) using identity functor on actual iterator 12853

Using thrust with 8 threads on CPU BandWidth (MB/s) using subvolume functor on actual iterator 3170 BandWidth (MB/s) using identity functor on actual iterator 13035

Using thrust with 9 threads on CPU BandWidth (MB/s) using subvolume functor on actual iterator 3374 BandWidth (MB/s) using identity functor on actual iterator 12938

Using thrust with 10 threads on CPU BandWidth (MB/s) using subvolume functor on actual iterator 3695 BandWidth (MB/s) using identity functor on actual iterator 13014

Using thrust with 11 threads on CPU BandWidth (MB/s) using subvolume functor on actual iterator 4040 BandWidth (MB/s) using identity functor on actual iterator 13011

Using thrust with 12 threads on CPU BandWidth (MB/s) using subvolume functor on actual iterator 4310 BandWidth (MB/s) using identity functor on actual iterator 13000

Using thrust with 1 threads on CPU BandWidth (MB/s) using subvolume functor on actual iterator 696 BandWidth (MB/s) using identity functor on actual iterator 8632

Using thrust with 2 threads on CPU BandWidth (MB/s) using subvolume functor on actual iterator 1378 BandWidth (MB/s) using identity functor on actual iterator 13153

Using thrust with 3 threads on CPU BandWidth (MB/s) using subvolume functor on actual iterator 2077 BandWidth (MB/s) using identity functor on actual iterator 13142

Using thrust with 4 threads on CPU BandWidth (MB/s) using subvolume functor on actual iterator 2754 BandWidth (MB/s) using identity functor on actual iterator 13256

Using thrust with 5 threads on CPU BandWidth (MB/s) using subvolume functor on actual iterator 3450 BandWidth (MB/s) using identity functor on actual iterator 13178

Using thrust with 6 threads on CPU BandWidth (MB/s) using subvolume functor on actual iterator 4116 BandWidth (MB/s) using identity functor on actual iterator 13202

Using thrust with 7 threads on CPU BandWidth (MB/s) using subvolume functor on actual iterator 2898 BandWidth (MB/s) using identity functor on actual iterator 12563

Using thrust with 8 threads on CPU BandWidth (MB/s) using subvolume functor on actual iterator 3008 BandWidth (MB/s) using identity functor on actual iterator 13008

Using thrust with 9 threads on CPU BandWidth (MB/s) using subvolume functor on actual iterator 3382 BandWidth (MB/s) using identity functor on actual iterator 12903

Using thrust with 10 threads on CPU BandWidth (MB/s) using subvolume functor on actual iterator 3757 BandWidth (MB/s) using identity functor on actual iterator 13074

Using thrust with 11 threads on CPU BandWidth (MB/s) using subvolume functor on actual iterator 4107 BandWidth (MB/s) using identity functor on actual iterator 13013

Using thrust with 12 threads on CPU BandWidth (MB/s) using subvolume functor on actual iterator 4393 BandWidth (MB/s) using identity functor on actual iterator 13045

Related topics