Shared Array of Custom Objects in Kernels

Hi! I want to use an array of const objects which are mathematical functions in all threads of my Kernel. I want to achieve something like this pseudocode, which gets the highest y = f(x) from a list of functions funcs:

__global__ void get_highest_value(int *x_vals, function **funcs, int i_max_out, int *y_max_out) {

    int id = blockIdx.x * blockDim.x + threadIdx.x;
    x = x_vals[id];

    // Assume y = f(x)
    int y_max = 0;
    int i_max = 0;

    int y_temp = 0;
    for(int i = 0; i < 10; i++) { // Assume 10 functions (temporary)

        // Evaluate for every function
        y_temp = funcs[i].evaluate(x)

        // Store highest y
        if(y_temp > y_max) {
            i_max = i
            y_max = y_temp;
        }
    }

    // return highest function and value
    i_max_out[id] = i_max;
    y_max_out[id] = y_max;
}

This code is a proof of concept for others, so I want to keep the code in this format, and avoid making big changes (as no need to optimise yet). Is this possible?

I’m not sure how to use the __shared__ setting for custom objects. Looking forward to some feedback!

For what purpose do you want to use shared memory?

I’m worried that using pointers to the same array would lead to race conditions, and copying the array in every thread might be a bad idea (I’m fairly new to GPU programming, so there may be something I’m missing!)

multiple access to const/read-only data should not present the possibility of race conditions.

Your pseudocode doesn’t copy anything to every thread, so I’m not sure why you would mention that. There are no thread-local copies of the array associated with funcs in your pseudocode.

Your threads each load an individual x (in a nicely coalesced fashion) and shared (between all threads) funcs from memory.

Can you tell, what kind of member variables and how many are stored in each element of funcs, which the evaluate() function accesses?

Those can be implicitly optimized with L1 and L2 caches, they could also be read from constant memory or with the __ldg intrinsic. They could also be copied to shared memory or perhaps calculated each time from scratch.

Without knowing a bit more, it is difficult to recommend one way of doing it.

Or are you just asking for any working implementation as similar to this one as possible? Is function a std::function or can you provide a proof-of-concept definition for a possible definition? Is evaluate a virtual member function?

Thank you all for the feedback so far! I’m attaching a much more thorough pseudocode example, which might better explain my question:

// Function Class
class func {

  private:

    int is_positive_always;  // is f(x) Positive at all x?

  public:

    func(int is_positive_always) : is_positive_always(is_positive_always) {}
    virtual ~func() {}

    // Function Evaluation
    __device__ virtual double value(double x) = 0;

    // Positive
    int positive_always() { return is_positive_always; }
};



// Example of what a user might define: Here, just a sine function
class sine : public func {

  public:

    sine() : func(0) {}

  // Function Evaluation
  __device__ double value(double x) override { return sin(x); }

};



// Corrected kernel signature and logic
__global__ void get_highest_value(int *x_vals, func **funcs, int *i_max_out,
                                  int *y_max_out) {
  
  int id = blockIdx.x * blockDim.x + threadIdx.x;
  int x = x_vals[id];  // Corrected declaration

  int y_max = INT_MIN;  // Use INT_MIN to ensure any value is larger
  int i_max = 0;

  for (int i = 0; i < 10; i++) {
    int y_temp = funcs[i]->value(x);  // Corrected method call and syntax
    if (y_temp > y_max) {
      i_max = i;
      y_max = y_temp;
    }
  }

  i_max_out[id] = i_max;
  y_max_out[id] = y_max;
}



// Might have some logic issues here!
int main() {
  // Define functions array
  func **funcs;
  funcs = new func *[10];  // Assume 10 functions

  // Set up functions - for now, only sine
  for (int i = 0; i < 10; i++) {
    funcs[i] = new sine();
  }

  // Define x values
  thrust::device_vector<int> x_vals(1000);
  thrust::device_vector<int> y_max_out(1000);
  thrust::device_vector<int> i_max_out(1000);

  // Set up x values
  for (int i = 0; i < 1000; i++) {
    x_vals[i] = i;
  }

  // Call kernel
  get_highest_value<<<100, 10>>>(thrust::raw_pointer_cast(x_vals.data()),
                                 thrust::raw_pointer_cast(funcs),
                                 thrust::raw_pointer_cast(i_max_out.data()),
                                 thrust::raw_pointer_cast(y_max_out.data()));

  // Copy back y_max_out and i_max_out
  thrust::host_vector<int> y_max_out_h = y_max_out;
  thrust::host_vector<int> i_max_out_h = i_max_out;

  // Print out results
  for (int i = 0; i < 1000; i++) {
    std::cout << "x: " << x_vals[i] << " i_max: " << i_max_out_h[i]
              << " y_max: " << y_max_out_h[i] << std::endl;
  }

  // Free memory
  for (int i = 0; i < 10; i++) {
    delete funcs[i];
  }

  delete[] funcs;

  return 0;
}

For context, this style of programming is the norm in my area of research (a very relevant example from a tutorial can be found here: ps/.shower.py · master · Stefan Hoeche / tutorials · GitLab)

My aim is to demonstrate that we don’t need to deviate too much from our original code to run on a GPU.

Using the classes in Cuda code is not a problem by itself. But bridging host and device code is:

If an object is created in host code, invoking a virtual function for that object in device code has undefined behavior.

You somehow have to recreate the objects in device code. Or make them work without virtual functions.

Ok - I’ll try to build the objects on the device by construction. I will get back to you guys in a few days!