Thrust functor return array; code optimization

Hello,
I’m working on one c++ project using thrust. I have a code that maybe can be optimized, not sure. In the following, you will see a minimal working example. The input vector can be thought to be a matrix in row-major format. In a real application, the size of the matrix has the scale: row = 10,
column = 10^7. The functor should operate on the columns of the matrix and write the result back into the same column (number of threads = number of columns). The solution I have found is using the input of the functor (see “arr” below) to read and write the data. My questions are: will the code perform faster if one will use the iterators (in combination with the transform), instead of using the functor input? If it is possible, what combination of iterators should one use? And how to implement a functor that is returning an array? Or maybe there is a better approach?

Thank you.

#include <thrust/device_vector.h>

struct functor 
{
  const int R;
  const int C;
  float *arr;
  float *tempVec = NULL;
  float *condVec = NULL;
  
  functor(int _R, int _C, float *_arr) : R(_R), C(_C), arr(_arr) {};

  __host__ __device__
  void operator()(int i) {

    tempVec = new float[R];
    condVec = new float[R];
    
    // Collect the column entry into a temporary array
    int wi1 = 0;
    for(int j = i; j <= i+(R-1)*C; j = j + C){
        tempVec[wi1] = arr[j];
        condVec[wi1] = arr[j];
        wi1++;  
    }

    /* The main body of the functor starts here and
       in a real application, it is more complicated.
       In this example, I reduced it to some simple
       operation */
    thrust::sort(thrust::device, tempVec, tempVec + R);  
    float thr = tempVec[0]*tempVec[0] - tempVec[R-1];
    /* End of main body */ 


    // write the modified column back to the array
    int wi2 = 0; 
    for(int j = i; j <= i+(R-1)*C; j = j + C){
        
      if( (condVec[wi2] - thr) < 0.0){
        arr[j] = 0.0;
      }else{
        arr[j] = condVec[wi2] - thr;
      }

      wi2++;  
    }

    delete [] tempVec;
    delete [] condVec;

  }
};


int main() {

    int R = 3 ; // this will be the rows
    int C = 4 ; // and this the columns

    const int vecL = C*R; 

    // initialize host array
    float x[vecL] = {1.0, 2.0, 3.0, 1.5,  4.0, 6.0, 3.0, 3.0,  2.0, 5.4, 1.3, 3.2};

    // transfer to device
    thrust::device_vector<float> d_x(x, x + vecL);

    // call functor
    thrust::for_each_n(thrust::device, thrust::counting_iterator<size_t>(0), C, functor(R, C, thrust::raw_pointer_cast(d_x.data())));
    cudaDeviceSynchronize();
    
    // copy result to host
    thrust::host_vector<float> result_host = d_x;

    // print out result
    for(int i = 0; i<vecL; i++)
      std::cout << result_host[i] << "  ";

    std::cout<< "\n";   

    // result: 4  4  4.31  2.45  7  8  4.31  3.95  5  7.4  2.61  4.15

    return 0;
}

What you have might be a fairly efficient approach. On first glance it appears that your global loads and stores will be coalesced, the amount of data operated on per thread is small (~10 elements), and you should easily be able to launch enough threads to saturate any GPU (10^7).

The only place I would look for an obvious optimization would be to get the new/delete out of the functor, and use pre-allocated scratchpad instead. This would require O(N) storage, which is typical for an operation like sort anyway. If your GPU is on the smaller side, memory wise, then you may run out of memory trying to preallocate the scratchpad, which would be ~400MB. Since in-kernel new/delete is not terribly fast, I would expect this to make a noticeable performance improvement. That would especially be true if you are doing this operation more than once.

To convert this operation to something that is more “thrust-like” with iterators and algorithms, would most likely require you to:

  1. Identify a sequence of operations (thrust algorithms) that does what you want
  2. Devise a segmentation scheme for these algorithms
  3. Launch the algorithms with a large thread array. So instead of launching across a workspace of 10^7, you would use a workspace of 10^8, for example.

If we were to pretend that the sort+thresholding is what you were actually doing, the first task would be to understand how to do a segmented sort in thrust. There are many writeups on this. It would actually probably require 2 sorting operations, one to bring like segments together (currently the occupy separate columns) and then one to do the actual sorting. Both sorts would probably require custom functors. After that, you would have all of your sorts done, and then it becomes an indexing exercise to get each thread to make the proper comparison for the thresholding operation. The final sequence might something like 2 sorts, followed by a thrust::transform followed by thrust::copy_if, for example.

It’s a fair amount of work to recast what you have shown that way, and its not at all obvious to me that it would be any faster, so I leave that as an exercise for the reader. In addition, you’ve already indicated that sort is not what you are doing, so I wouldn’t want to waste my time writing up an example that way.

Thanks for your reply, Robert.
I’m not sure if I understand “use pre-allocated scratchpad” correctly. See the code. Indeed, removing new/delete out of the functor gives a speedup of x2.5 (for the real application function, not this example). Interesting enough, the global memory usage of GTX 1050 TI didn’t increase significantly. However, the memory for the variable “tempVec” is limited to 4096 bytes ( By setting the size too high one get: Error: Formal parameter space overflowed (16028 bytes required, max 4096 bytes allowed). What is the name for this memory? Some temporal buffer for each thread?

#include <thrust/device_vector.h>

struct functor 
{
  const int R;
  const int C;
  float *arr;
  float tempVec[100];
  float condVec[100];
  
  functor(int _R, int _C, float *_arr) : R(_R), C(_C), arr(_arr) {};

  __host__ __device__
  void operator()(int i) {
    
    // Collect the column entry into a temporary array
    int wi1 = 0;
    for(int j = i; j <= i+(R-1)*C; j = j + C){
        tempVec[wi1] = arr[j];
        condVec[wi1] = arr[j];
        wi1++;  
    }

    /* The main body of the functor starts here and
       in a real application, it is more complicated.
       In this example, I reduced it to some simple
       operation */
    thrust::sort(thrust::device, tempVec, tempVec + R);  
    float thr = tempVec[0]*tempVec[0] - tempVec[R-1];
    /* End of main body */ 


    // write the modified column back to the array
    int wi2 = 0; 
    for(int j = i; j <= i+(R-1)*C; j = j + C){
        
      if( (condVec[wi2] - thr) < 0.0){
        arr[j] = 0.0;
      }else{
        arr[j] = condVec[wi2] - thr;
      }

      wi2++;  
    }
  }
};


int main() {

    int R = 3 ; // this will be the rows
    int C = 4 ; // and this the columns

    const int vecL = C*R; 

    // initialize host array
    float x[vecL] = {1.0, 2.0, 3.0, 1.5,  4.0, 6.0, 3.0, 3.0,  2.0, 5.4, 1.3, 3.2};

    // transfer to device
    thrust::device_vector<float> d_x(x, x + vecL);

    // call functor
    thrust::for_each_n(thrust::device, thrust::counting_iterator<size_t>(0), C, functor(R, C, thrust::raw_pointer_cast(d_x.data())));
    cudaDeviceSynchronize();
    
    // copy result to host
    thrust::host_vector<float> result_host = d_x;

    // print out result
    for(int i = 0; i<vecL; i++)
      std::cout << result_host[i] << "  ";

    std::cout<< "\n";   

    // result: 4  4  4.31  2.45  7  8  4.31  3.95  5  7.4  2.61  4.15

    return 0;
}

The functor is passed by value to the thrust algorithm. This means that it is effectively an argument to a kernel call “under the hood”. As a result, like all arguments passed to CUDA kernel calls, the aggregate size limit per kernel call is 4096 bytes:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#function-parameters

I didn’t really have in mind the use of local memory (which is the logical space that your tempVec and condVec reside in, in your updated example), but if everything is working to your satisfaction, there is no particular problem with it.

What I had in mind was pre-allocated space provided e.g. by cudaMalloc, and the base pointer to the pre-allocated scratchpad area would be provided to the functor as an initializing parameter similar to how you pass the _arr parameter. Thereafter, the functor will determine its own slice of the scratchpad area by doing a multiplication of the size of the per-functor space by the main functor argument (i).

float *tempVec;
  float *condVec;
  
  functor(int _R, int _C, float *_arr, float *_tV, float *_cV) : R(_R), C(_C), arr(_arr), tempVec(_tV), condVec(_cV) {};

  __host__ __device__
  void operator()(int i) {
    tempVec += i*(tempVec_SIZE);
    condVec += i*(condVec_SIZE);
    // Collect the column entry into a temporary array
    int wi1 = 0;

Again, no particular reason to consider this if you’re happy with what you’ve got. I don’t expect this to be noticeably faster. However this would get around the 4096 limit if you were using larger per-thread scratchpad (which would also have implications for total allocation size, so could be problematic anyway). There are methods to limit the scratchpad allocation size to the maximum runtime number of threads, reusing the allocations that way, but it introduces considerable complexity:

https://stackoverflow.com/questions/59109845/memory-allocation-and-indexing-tied-to-sm-core-in-cuda

However such a method would allow you to combine the ideas while using a larger per-thread scratchpad space.