Experiment to reduce time execution with tensors

Hello everyone,

I am a master’s researcher at UFCG and I’m working on optimizing an algorithm to reduce its execution time as much as possible. The calculation is based on multiplying a matrix by constants.

Formula:

evapotranspiration_24h_d[pos] = (86400 / ((2.501 - 0.0236 * temperature_celcius) * pow(10, 6))) * (latent_heat_flux_d[pos] / (net_radiation_d[pos] - soil_heat_d[pos])) * net_radiation_24h_d[pos];

I’ve implemented this using both Stream Processors and cuTensor, as shown below:
Stream Processors:

global void evapotranspiration_24h_kernel(float *surface_temperature_d, float *latent_heat_flux_d, float *net_radiation_d, float *soil_heat_d, float *net_radiation_24h_d, float *evapotranspiration_24h_d) {
   unsigned int idx = threadIdx.x + blockIdx.x * blockDim.x;
   if (idx < width_d * height_d) {
      unsigned int row = idx / width_d;
      unsigned int col = idx % width_d;
      unsigned int pos = row * width_d + col;
      float temperature_celcius = surface_temperature_d[pos] - 273.15;
      evapotranspiration_24h_d[pos] = (86400 / ((2.501 - 0.0236 * temperature_celcius) * pow(10, 6))) * (latent_heat_flux_d[pos] / (net_radiation_d[pos] - soil_heat_d[pos])) * net_radiation_24h_d[pos];
  }
}

cuTensor:

// tensor_plan_binary_div config:
HANDLE_CUTENSOR_ERROR(cutensorCreateElementwiseBinary(this->handle, &desc,
this->descA, this->axis.data(), CUTENSOR_OP_IDENTITY,
this->descB, this->axis.data(), CUTENSOR_OP_RCP,
this->descC, this->axis.data(),
CUTENSOR_OP_MUL, descCompute));

// tensor_plan_binary_add config:
HANDLE_CUTENSOR_ERROR(cutensorCreateElementwiseBinary(this->handle, &desc,
this->descA, this->axis.data(), CUTENSOR_OP_IDENTITY,
this->descB, this->axis.data(), CUTENSOR_OP_IDENTITY,
this->descC, this->axis.data(),
CUTENSOR_OP_ADD, descCompute));

// (86400 / ((2.501 - 0.0236 * temperature_celcius) * pow(10, 6)))    HANDLE_CUTENSOR_ERROR(cutensorElementwiseBinaryExecute(tensors.handle, tensors.tensor_plan_binary_add, (void *)&pos1, products.surface_temperature_d, (void *)&neg27315, products.only1_d, products.tensor_aux1_d, tensors.stream));
HANDLE_CUTENSOR_ERROR(cutensorElementwiseBinaryExecute(tensors.handle, tensors.tensor_plan_binary_add, (void *)&pos2501, products.only1_d, (void *)&neg00236, products.tensor_aux1_d, products.tensor_aux1_d, tensors.stream));
HANDLE_CUTENSOR_ERROR(cutensorPermute(tensors.handle, tensors.tensor_plan_permute_id, (void *)&pow10, products.tensor_aux1_d, products.tensor_aux1_d, tensors.stream));
HANDLE_CUTENSOR_ERROR(cutensorElementwiseBinaryExecute(tensors.handle, tensors.tensor_plan_binary_div, (void *)&pos86400, products.only1_d, (void *)&pos1, products.tensor_aux1_d, products.tensor_aux1_d, tensors.stream));

// (latent_heat_flux_d[pos] / (net_radiation_d[pos] - soil_heat_d[pos]))    HANDLE_CUTENSOR_ERROR(cutensorElementwiseBinaryExecute(tensors.handle, tensors.tensor_plan_binary_add, (void *)&pos1, products.net_radiation_d, (void *)&neg1, products.soil_heat_d, products.evapotranspiration_24h_d, tensors.stream));
HANDLE_CUTENSOR_ERROR(cutensorElementwiseBinaryExecute(tensors.handle, tensors.tensor_plan_binary_div, (void *)&pos1, products.latent_heat_flux_d, (void *)&pos1, products.evapotranspiration_24h_d, products.evapotranspiration_24h_d, tensors.stream));

// evapotranspiration_24h_d[pos] = (86400 / ((2.501 - 0.0236 * temperature_celcius) * pow(10, 6))) * (latent_heat_flux_d[pos] / (net_radiation_d[pos] - soil_heat_d[pos])) * net_radiation_24h_d[pos];
HANDLE_CUTENSOR_ERROR(cutensorElementwiseBinaryExecute(tensors.handle, tensors.tensor_plan_binary_mult, (void *)&pos1, products.tensor_aux1_d, (void *)&pos1, products.evapotranspiration_24h_d, products.evapotranspiration_24h_d, tensors.stream));
HANDLE_CUTENSOR_ERROR(cutensorElementwiseBinaryExecute(tensors.handle, tensors.tensor_plan_binary_mult, (void *)&pos1, products.evapotranspiration_24h_d, (void *)&pos1, products.net_radiation_24h_d, products.evapotranspiration_24h_d, tensors.stream));

Experiment details:

  • GPU: [specific model - e.g., RTX 3090/A100]
  • Architecture: Ampere
  • Precision: FP32
  • Operation: Hadamard product (element-wise)
  • Matrix dimensions: [dimensions]
  • Libraries compared: cuTENSOR vs [implementation used with CUDA Cores]
  • Metrics: [execution time/throughput observed in both cases]

In this optimization, the Stream Processors implementation took 19.06 ms while the cuTensor implementation took only 5.32 ms.

I would like help to better understand several aspects:

  1. Despite significantly reducing the execution time, this code seems somewhat hacky? Is there a more appropriate way to implement it?

  2. Have you explored any similar approaches? Has the NVIDIA team encountered other researchers observing similar performance gains for element-wise operations using cuTENSOR?

  3. Is there any documentation or specific applicability of tensor operations for element-wise operations?

  4. How would you explain this time difference?

  5. Does this approach make sense? Is this performance gain reasonable?

[1] Your variables are of type float, but your literal floating-point constants are of type double, turning the entire computation into double-computation, which is very slow on consumer-grade GPUs (e.g. RTX 3090).

[2] In general, pow() is one of the slowest math functions. While the CUDA math library may include optimizations for pow() with integer arguments, and the compiler may optimize to compile-time computation, this is not something that can be assumed and relied on. Most likely this maps to a slow double implementation of pow(). Best to just use a floating-point literal: 10e6f or 1000000.0f.

Unless width_d is a compile-time constant, this will result in comparatively slow integer divisions. Whether the CUDA compiler can unify the / and % operators into a single division (with back multiply to compute the remainder) is not certain.

On second thought, this entire row, col computation seems redundant. Isn’t it the case that pos == idx, and row and col are never used elsewhere?

__global__ void evapotranspiration_24h_kernel(float *surface_temperature_d, float *latent_heat_flux_d, float *net_radiation_d, float *soil_heat_d, float *net_radiation_24h_d, float *evapotranspiration_24h_d) {

It probably does not matter in a kernel this simple as the compiler can probably figure this out by itself, but in general I would suggest using const and __restrict__ qualifiers as applicable:

__ global__ void evapotranspiration_24h_kernel (const float * __restrict__ surface_temperature_d,
                                                const float * __restrict__ latent_heat_flux_d, 
                                                const float * __restrict__ net_radiation_d, 
                                                const float * __restrict__ soil_heat_d, 
                                                const float * __restrict__ net_radiation_24h_d, 
                                                float * __restrict__ evapotranspiration_24h_d) 

Please note that the CUDA compiler treats floating-point expressions conservatively and will not perform floating-point re-association with the exception of contracting FMUL with dependent FADD into FMA. You would therefore want to apply your own algebraic simplifications where numerical properties allow it. For example:

__global__ void evapotranspiration_24h_kernel (const float * __restrict__ surface_temperature_d, 
                                               const float * __restrict__ latent_heat_flux_d, 
                                               const float * __restrict__ net_radiation_d, 
                                               const float * __restrict__ soil_heat_d, 
                                               const float * __restrict__ net_radiation_24h_d, 
                                               float * __restrict__ evapotranspiration_24h_d) 
{
   unsigned int pos = threadIdx.x + blockIdx.x * blockDim.x;
   if (pos < width_d * height_d) {
      evapotranspiration_24h_d[pos]=(-3.66102f * latent_heat_flux_d[pos] * net_radiation_24h_d[pos]) / 
                                    ((surface_temperature_d[pos] - 379.125f) * (net_radiation_d[pos] - soil_heat_d[pos]));
  }
}
1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.