How to optimize functor for transform iterators

We are porting a CFD code to GPU using thrust. our fields are defined on 4D lattices, and often we need to operate on subset of the lattice. For this, we combine a transform iterator with a permutation iterator. However, we take a significant performance hit when we do that, up toa factor 10 slower on CPU, and a factor 2 slower on GPU. The performance degradation in CPU is particularly worrisome since not all of our user base has access to GPU. That is why we use thrust in first place.
The permutation iterator has the structure

auto It = thrust::make_permutation_iterator(Iterator,
                                                thrust::make_transform_iterator(
                                                thrust::counting_iterator<unsigned int>(0), functor));

where Iterator is the standard iterator associated to host or device vectors, and functor is an instantiation of this struct

/*Functor that transforms a sequence of integers 0 <= idx < N into the position
indices of a subvolume of a volume. The volume is specified as [0, nx)x[0,ny)x[0,nz)x[0,nq),
while the subvolume  is given in numpy notation as [lx:-hx, ly:-hy, lz:-hz, lq:-hq]. 
N=(nx-(lx+hx))*(ny-(ly+hy))*etc. It is used to make transform iterators that are fed to 
permutation iterators to limit calculations to the subvolume.
*/
template <typename IndType>
struct subvolume_map
{
    IndType nx, ny, nz, nq, lx, ly, lz, lq, dx, dy, dz, dq;
    
    subvolume_map(std::array<IndType, 3> _n, std::array<IndType, 3> _l, std::array<IndType, 3> _h,
                  IndType _nq, IndType _lq, IndType _hq) : nx(_n[0]), ny(_n[1]), nz(_n[2]), nq(_nq),
                                                           lx(_l[0]), ly(_l[1]), lz(_l[2]), lq(_lq),
                                                           dx(_n[0] - (_l[0] + _h[0])),
                                                           dy(_n[1] - (_l[1] + _h[1])),
                                                           dz(_n[2] - (_l[2] + _h[2])) {}
                                                           
   

   
    __host__ __device__
        IndType
        operator()(IndType idx)
    {

        
        auto rx = idx / dx;
#ifndef __CUDA_ARCH__
        auto ix = idx % dx + lx;
#else
        auto ix = idx - (rx * dx) + lx;
#endif
        auto ry = rx / dy;
#ifndef __CUDA_ARCH__
        auto iy = rx % dy + ly;
#else
        auto iy = rx - (ry * dy) + ly;
#endif
        auto rz = ry / dz;
#ifndef __CUDA_ARCH__
        auto iz = ry % dz + lz;
#else
        auto iz = ry - (rz * dz) + lz;
#endif
        auto iq = rz + lq;

        return (((iq)*nz + iz) * ny + iy) * nx + ix;
        
        
        // }
    }
};

Note by setting _l, _h, _lq and _lh to zero, this functor is equivalent to thrust::identity. Even in this case, we observe large performance drops (as compared to using thrust::identity as the functor), which suggests that the problem is not with memory bandwidth issues (e.g., cache misses in CPU), but with the efficiency (or lack thereof) of the operator(). Any suggestion on how to improve on this state of affairs is welcome.

related question here:

https://stackoverflow.com/questions/54661513/what-is-the-most-efficient-way-to-apply-a-functor-to-a-subset-of-a-device-array

I don’t think this is a response to your question, but I want to point out that it is still possible to get bad performance with this.

Data access patterns on the GPU matter for performance. If your subvolume has a significant (say, 32 or larger) width in the x dimension, the performance of whatever you do will be noticeably better than if it does not. As you increase the width in the x dimension beyond 32, there will probably be additional significant gains.

Conversely, if you slice the volume only in one of the other directions, performance on the GPU will be markedly worse. This is a data layout/data access pattern problem, and there is no simple solution if you want to slice data in arbitrary subvolumes. Some slice patterns will be much much better than others. The trick is to organize your underlying data layout so that the most commonly used or performance-sensitive activities involve slicing the data preferentially along the x direction.

My focus is GPU here, but to some extent it will affect CPU usage as well. If you go outside of cache dimensions/capability, CPU performance will suffer greatly.

As an example, we use thrust::fill to fill an array of size 256x512x392x1 with zeros.
We set the parameter of subvolume_map so that the resulting permutation iterator is functionally identical to the standard iterator (i.e., subvolume_map(n)=n). These are the results we get

Using thrust with 6 threads on CPU
BandWidth (MB/s) using subvolume functor on array iterator 763
BandWidth (MB/s) using identity functor on array iterator 9822

Using thrust on GPU
BandWidth (MB/s) using subvolume functor 107131
BandWidth (MB/s) using identity functor 174938

In this test, the access pattern is the same between the two cases, so the difference (we think!) must be due to the cost of invoking subvolume_map(n) vs n.

At least on the GPU, I would expect the functor to be memory bandwidth bound for operations like filling memory, but that may not be the case. The functor does 4 integer divisions (the most expensive of the various operations it does), and this may be converting from memory bound to compute (integer) bound.

I’m not sure why you partitioned the functor behavior between CPU and GPU. I would expect modulo arithmetic to be more expensive than the realization I created. Have you compared both cases on the CPU?

The behavior of the functor (and the GPU in general) will be best (i.e. best amortized) if you do complex operations, not trivial operations like filling memory. That might be a worst case measurement.

The partitioning of the functor produced very modest gains, but we are talking a few percentage points at the most, and mostly on CPU, where x86 provides a single instruction to calculate / and remainder.
However, our concern is mostly with the severe degradation in performance on the CPU side (as I mentioned, not all our users
have access to GPUs). I agree that using fill exposes the worst possible scenario, but we observe similar behavior for more complicated operations, e.g. a+=a*b type of operations.

Interestingly, if we modify the functor operator() to check if nx==dx, ny==dy, nz==dz and if so to return
idx+(idx/(nxnynz)+lq)nxny*nz
we obtain a considerable speedup, in the sense that even on CPU we are within a factor 3 relative to using thrust::identity.
If indeed we set the problem so that the last offset is zero, and we return idx, then there is no degradation relative to using
thrust::identity. At least this shows that, despite the fact that the default functor copy constructor is invoked at every iteration (
we base this assertion on tests in which we had an explicit destructor that logged a call to it) it is not that part that slows down

As for the integer divisions and modulo operations with shared divisor on the GPU, generations of CUDA compilers have gone back and forth in their ability to optimize that into a division followed by backmultiply-subtraction to compute the modulo.

If you are concerned about the performance of this computation, I would suggest double-checking the generated machine code (SASS) to make sure you don’t get separate division / modulo operations, or as a conservative approach simply do the backmultiply-subtract for the modulo manually at HLL level.

Since there is no hardware support for division on the GPU, 64-bit integer divisions in particular are expensive, while 32-bit integer divisions are tolerable in cost (~20 instructions, if I recall correctly).

The CUDA toolchain uses standard optimizations for divisions by compile-time constants, so making some of those divisors template parameters would further speed up this computation.

Any discussion about the CPU and GPU performance should probably include information what CPUs and GPUs you are measuring on, as there is quite a bit of performance variability in GPU and CPU memory systems (in GPUs, mostly the type of memory i.e. HBM2 vs GDDR5, in CPUs mostly the number of DDR4 channels).

The original code (which I wrote) explicitly does backmultiply-subtract instead of modulo, since it follows the requisite division operation.

The claim here apparently on the CPU side is that the modulo operation is faster than backmultiply-subtract.

I see. I didn’t realize the code in the OP’s #1 had been written by you. It is true that the x86 IDIV instruction returns both quotient and remainder in one feel swoop. But it still requires a compiler optimization to recognize that a division and a modulo operation use the same divisor and thus can be accomplished with one IDIV.

The CUDA toolchain has that capability in principle but seems to regularly lose it for unknown reasons. I don’t know what the current status is.

Given that IDIV on x86 (~25 cycles) is much slower than IMUL (4 cycles), the backmultiply approach will be a bit slower compared to a single IDIV delivering both quotient and remainder, but not dramatically so.

Thank you both for your comments. I am going to post here the code we use for testing

#include <cmath>
#include <omp.h>
#include <cusp/array1d.h>
#include <thrust/iterator/discard_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/fill.h>
#include <thrust/functional.h>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/transform.h>
#include <chrono>
/* simple timer utility based on std::chrono
 Example of Usage is 
 tictoc<milliseconds> T();
 T.tic();
 { Code to be timed}
 std::cout << " Time elapsed since last call to tic in milliseconds is " << T.toc() << std::endl;

 T.reset() // reset internal elapsed to 0.

*/
template <typename DurationUnits>
class tictoc
{
private:
    double m_elapsed;
    std::chrono::time_point<std::chrono::steady_clock> m_start;

public:
    tictoc(double _a = (double)0) : m_elapsed(_a) {}

    void tic()
    {
        m_start = std::chrono::steady_clock::now();
    }
    double toc()
    {
        auto stop = std::chrono::steady_clock::now();
        m_elapsed += (double)std::chrono::duration_cast<std::chrono::duration<DurationUnits>>(stop - m_start).count();
        return m_elapsed;
    }

    void reset()
    {
        m_elapsed = (double)0;
    }
};

/*Functor that transforms a sequence of integers 0 <= idx < N into the position
indices of a subvolume of a volume. The volume is specified as [0, nx)x[0,ny)x[0,nz)x[0,nq),
while the subvolume  is given in numpy notation as [lx:-hx, ly:-hy, lz:-hz, lq:-hq]. 
N=(nx-(lx+hx))*(ny-(ly+hy))*etc. It is used to make transform iterators that are fed to 
permutation iterators to limit calculations to the subvolume.
*/
template <typename IndType>
struct subvolume_map
{
    IndType nx, ny, nz, nq, lx, ly, lz, lq, dx, dy, dz, dq;

    subvolume_map(std::array<IndType, 3> _n, std::array<IndType, 3> _l, std::array<IndType, 3> _h,
                  IndType _nq, IndType _lq, IndType _hq) : nx(_n[0]), ny(_n[1]), nz(_n[2]), nq(_nq),
                                                           lx(_l[0]), ly(_l[1]), lz(_l[2]), lq(_lq),
                                                           dx(_n[0] - (_l[0] + _h[0])),
                                                           dy(_n[1] - (_l[1] + _h[1])),
                                                           dz(_n[2] - (_l[2] + _h[2])) {}

    __host__ __device__
        IndType
        operator()(IndType idx)
    {

#if 0
        auto rx = idx / dx;
        auto ix = idx % dx + lx;
        auto ry = rx / dy;
        auto iy = rx % dy + ly;
        auto rz = ry / dz;
        auto iz = ry % dz + lz;
        auto iq = rz + lq;
        return (((iq)*nz + iz) * ny + iy) * nx + ix;
#else
        auto rx = idx / dx;
        auto ix = idx - (rx * dx);
        auto ry = rx / dy;
        auto iy = rx - (ry * dy);
        auto rz = ry / dz;
        auto iz = ry - (rz * dz);
        auto iq = rz;
        return (((iq + lq) * nz + iz + lz) * ny + iy + ly) * nx + ix + lx;
#endif
    }
};

template <typename T, typename F>
decltype(tictoc<double>().toc()) testFunctor(T it, F &f, unsigned int Ni)
{
    auto It = thrust::make_permutation_iterator(it,
                                                thrust::make_transform_iterator(
                                                    thrust::counting_iterator<unsigned int>(0), f));
    tictoc<double> Tind;
    Tind.reset();
    cudaDeviceSynchronize();
    Tind.tic();
    thrust::fill(It, It + Ni, 0.);
    cudaDeviceSynchronize();
    return Tind.toc();
}

template <typename T>
decltype(tictoc<double>().toc()) testFunctor(T it, unsigned int Ni)
{
    auto It = thrust::make_permutation_iterator(it,
                                                thrust::make_transform_iterator(
                                                    thrust::counting_iterator<unsigned int>(0), thrust::identity<unsigned int>()));
    tictoc<double> Tind;
    Tind.reset();
    cudaDeviceSynchronize();
    Tind.tic();
    thrust::fill(It, It + Ni, 0.);
    cudaDeviceSynchronize();
    return Tind.toc();
}
int main()
{
    int nt = 1;

    unsigned int nx = 256, ny = 512, nz = 392;
    //unsigned int nx = 2, ny = 2, nz = 2;
    // for these values, the functor subvolume_map behaves like the identity.
    std::array<unsigned int, 3> n = {nx, ny, nz};
    std::array<unsigned int, 3> l = {0, 0, 0};
    std::array<unsigned int, 3> h = {0, 0, 0};
    unsigned int nq = 1;
    unsigned int lq = 0;
    unsigned int hq = 0;
    unsigned int Ni = (nx - l[0] - h[0]) * (ny - l[1] - h[1]) * (nz - l[2] - h[2]) * (nq - lq - hq);
    unsigned int Nt = (nx) * (ny) * (nz) * (nq);
    thrust::host_vector<double> V(Nt);
    thrust::device_vector<double> Vd(Nt);
    subvolume_map<unsigned int> SV(n, l, h, nq, lq, hq);
    while (nt < 13)
    {
        omp_set_num_threads(nt);
        std::cout << " --------------------- " << std::endl;
        std::cout << " Using thrust with " << omp_get_max_threads() << " threads on CPU " << std::endl;

        auto T = testFunctor(V.begin(), SV, Ni);

        std::cout << " BandWidth (MB/s) using subvolume functor on actual iterator " << int(Ni * sizeof(double) / T / 1.e6) << std::endl;
        T = testFunctor(V.begin(), Ni);
        std::cout << " BandWidth (MB/s) using identity functor on actual iterator " << int(Ni * sizeof(double) / T / 1.e6) << std::endl;
        nt += 1;
        
    }

    std::cout << " --------------------- " << std::endl;
    std::cout << " Using thrust  on GPU " << std::endl;

    auto T = testFunctor(Vd.begin(), SV, Ni);
    std::cout << " BandWidth (MB/s) using subvolume functor " << int(Ni * sizeof(double) / T / 1.e6) << std::endl;
    T = testFunctor(Vd.begin(), Ni);
    std::cout << " BandWidth (MB/s) using identity functor " << int(Ni * sizeof(double) / T / 1.e6) << std::endl;

    std::cout << " testing the functor " << std::endl;

    tictoc<double> Tind;
    Tind.reset();
    Tind.tic();
    unsigned int i = 0;
    
    #pragma omp parallel
    #pragma omp for
    for (i=0; i < Ni; ++i)
    {
        V[SV(i)]=0.;
    }
    T = Tind.toc();
    std::cout << V[0] << std::endl;
    std::cout << " Bandwidth (MB/s) for subvolume functor in while loop " << int(Ni * sizeof(double) / T / 1.e6) << std::endl;

    Tind.reset();
    Tind.tic();
    i = 0;
    
     #pragma omp parallel
    #pragma omp for
    for (i = 0; i <Ni; ++i)
    {
        V[thrust::identity<unsigned int>()(i)]=0.;
    }
    T = Tind.toc();
    std::cout << V[0] << std::endl; // to avoid the compiler optimizing out the loop
    std::cout << " Bandwidth (MB/s) for identity functor in while loop " << int(Ni * sizeof(double) / T / 1.e6) << std::endl;
    return 0;
}

The difference between using the remainder vs backmultiply is actually pretty small. This is what we get using backmultiply


Using thrust with 1 threads on CPU
BandWidth (MB/s) using subvolume functor on actual iterator 650
BandWidth (MB/s) using identity functor on actual iterator 8704

Using thrust with 2 threads on CPU
BandWidth (MB/s) using subvolume functor on actual iterator 1290
BandWidth (MB/s) using identity functor on actual iterator 13111

Using thrust with 3 threads on CPU
BandWidth (MB/s) using subvolume functor on actual iterator 1943
BandWidth (MB/s) using identity functor on actual iterator 13310

Using thrust with 4 threads on CPU
BandWidth (MB/s) using subvolume functor on actual iterator 2586
BandWidth (MB/s) using identity functor on actual iterator 13248

Using thrust with 5 threads on CPU
BandWidth (MB/s) using subvolume functor on actual iterator 3214
BandWidth (MB/s) using identity functor on actual iterator 13177

Using thrust with 6 threads on CPU
BandWidth (MB/s) using subvolume functor on actual iterator 3841
BandWidth (MB/s) using identity functor on actual iterator 13131

Using thrust with 7 threads on CPU
BandWidth (MB/s) using subvolume functor on actual iterator 2812
BandWidth (MB/s) using identity functor on actual iterator 12853

Using thrust with 8 threads on CPU
BandWidth (MB/s) using subvolume functor on actual iterator 3170
BandWidth (MB/s) using identity functor on actual iterator 13035

Using thrust with 9 threads on CPU
BandWidth (MB/s) using subvolume functor on actual iterator 3374
BandWidth (MB/s) using identity functor on actual iterator 12938

Using thrust with 10 threads on CPU
BandWidth (MB/s) using subvolume functor on actual iterator 3695
BandWidth (MB/s) using identity functor on actual iterator 13014

Using thrust with 11 threads on CPU
BandWidth (MB/s) using subvolume functor on actual iterator 4040
BandWidth (MB/s) using identity functor on actual iterator 13011

Using thrust with 12 threads on CPU
BandWidth (MB/s) using subvolume functor on actual iterator 4310
BandWidth (MB/s) using identity functor on actual iterator 13000

Using thrust on GPU
BandWidth (MB/s) using subvolume functor 94887
BandWidth (MB/s) using identity functor 157701
testing the functor
0
Bandwidth (MB/s) for subvolume functor in while loop 4449
0
Bandwidth (MB/s) for identity functor in while loop 12907

Whereas this is what we get using the remainder


Using thrust with 1 threads on CPU
BandWidth (MB/s) using subvolume functor on actual iterator 696
BandWidth (MB/s) using identity functor on actual iterator 8632

Using thrust with 2 threads on CPU
BandWidth (MB/s) using subvolume functor on actual iterator 1378
BandWidth (MB/s) using identity functor on actual iterator 13153

Using thrust with 3 threads on CPU
BandWidth (MB/s) using subvolume functor on actual iterator 2077
BandWidth (MB/s) using identity functor on actual iterator 13142

Using thrust with 4 threads on CPU
BandWidth (MB/s) using subvolume functor on actual iterator 2754
BandWidth (MB/s) using identity functor on actual iterator 13256

Using thrust with 5 threads on CPU
BandWidth (MB/s) using subvolume functor on actual iterator 3450
BandWidth (MB/s) using identity functor on actual iterator 13178

Using thrust with 6 threads on CPU
BandWidth (MB/s) using subvolume functor on actual iterator 4116
BandWidth (MB/s) using identity functor on actual iterator 13202

Using thrust with 7 threads on CPU
BandWidth (MB/s) using subvolume functor on actual iterator 2898
BandWidth (MB/s) using identity functor on actual iterator 12563

Using thrust with 8 threads on CPU
BandWidth (MB/s) using subvolume functor on actual iterator 3008
BandWidth (MB/s) using identity functor on actual iterator 13008

Using thrust with 9 threads on CPU
BandWidth (MB/s) using subvolume functor on actual iterator 3382
BandWidth (MB/s) using identity functor on actual iterator 12903

Using thrust with 10 threads on CPU
BandWidth (MB/s) using subvolume functor on actual iterator 3757
BandWidth (MB/s) using identity functor on actual iterator 13074

Using thrust with 11 threads on CPU
BandWidth (MB/s) using subvolume functor on actual iterator 4107
BandWidth (MB/s) using identity functor on actual iterator 13013

Using thrust with 12 threads on CPU
BandWidth (MB/s) using subvolume functor on actual iterator 4393
BandWidth (MB/s) using identity functor on actual iterator 13045

Using thrust on GPU
BandWidth (MB/s) using subvolume functor 105597
BandWidth (MB/s) using identity functor 156678
testing the functor
0
Bandwidth (MB/s) for subvolume functor in while loop 4478
0
Bandwidth (MB/s) for identity functor in while loop 13073

The difference is minimal, compared to the difference between functor and identity. These tests were performed on a (oldish) machine with a Intel® Core™ i7 CPU X 990 @ 3.47GHz (6 physical cores, 12 virtual cores). Memory is DDR2. The GPU is a vintage GeForce GTX 680, 2GB ot total memory. nvcc V9.0, thrust V1.9.0, gcc V7.3.1, and compiled with -O3 optimization.

If I read the data correctly, on CPU the process seems CPU bound using the functor, since bandwidth saturates when we hit 6 threads, whereas using the identity functor it saturates at 2 threads. I guess the next step would be to inspect the assembly for the functor…

A bit older than “oldish”, I would claim :-) Intel’s ARK database says the memory subsystem of the CPU is based on DDR3-1066, providing a theoretical bandwidth of 25.6 GB/sec. The GTX 680 is the same vintage as the CPU (around 2012), with theoretical bandwidth of 192 GB/sec.

In light of this your measured performance numbers make a lot more sense than they did before. The small difference between using the modulo operation (optimized away by the compiler) and explicit backmultiply also makes perfect sense.

Still, enough to explain the factor 3 to 10 difference on CPU between functor and identity? On the GPU side, the identity approaches the theoretical limit, the functor is still almost half as slow…
And yes, I will use your comment to make a case for getting new machines with my boss :-)

I haven’t done any serious optimization work on x86 machines for 15+ years, so I cannot really comment on why the difference between functor and identity is so large for the CPU version. Generally speaking, I would say “the profiler is your friend”. VTUNE should be able to tell you quite a bit about bottlenecks in your code.

I am well aware that the universal lot of all software developers is to have ancient hardware furnished to them by their employers :-)

Mostly my focus has been on the GPU side of things. I believe that the observations here are basically correct, that the functor with the four division operations is cutting the achieved bandwidth approximately in half (on some GPUs - see below). I attribute this to the amount of integer pressure it is creating.

On a tesla P100, for example, I witness this:

Using thrust  on GPU
 BandWidth (MB/s) using subvolume functor 268889
 BandWidth (MB/s) using identity functor 575854

Tesla V100 has improved integer handling, and I observe this:

Using thrust  on GPU
 BandWidth (MB/s) using subvolume functor 848096
 BandWidth (MB/s) using identity functor 859069

I’ve done some GPU profiling to confirm that global store efficiency is 100% in both cases (division and identity functor) and that the total number of transactions is the same. The GPU profiler also reports approximately half dram bandwidth achieved (on the P100) for the division functor case.

So I would conclude that this 1D->4D->1D mapping functor is “expensive” in terms of integer pressure, and a possible workaround would be to use a GPU that has improved integer throughput (volta/turing). If you had an expensive enough operation to perform, the cost of the functor might be more readily amortized over more reads and writes.

If you could ensure that your subvolume always had dimensions that were powers of 2, you might be able to improve the functor by converting divisions to shift operations.

Since CUDA has the ability to launch 3 dimensional grids, much of the 1D->4D->1D index generation associated with thrust could be dispensed with, by converting to ordinary CUDA. It should still be possible to create basic templated operations that get passed to a “generic” kernel, such as with thrust. A simple transform kernel would be trivial to demonstrate in CUDA.

Thanks Robert. The numbers on the Tesla are impressive!

Mostly for the benefit of those who may land on this thread, I can confirm that Robert’s suggestion of using bitwise shift operators when the sublattice dimensions are a power of 2 produces a speed up
in CPU of a factor 2. On GPU, it depends: on our old GTX 680, the speedup is a factor 1.5.
We recently got a Titan V (thank you boss!), and there the speedup is negligible. However,
on the Titan V the original code was already very close to the identity functor, so not much room to
improve.