Experiment to reduce time execution with tensors

cilas · March 14, 2025, 8:43pm

Hello everyone,

I am a master’s researcher at UFCG and I’m working on optimizing an algorithm to reduce its execution time as much as possible. The calculation is based on multiplying a matrix by constants.

Formula:

evapotranspiration_24h_d[pos] = (86400 / ((2.501 - 0.0236 * temperature_celcius) * pow(10, 6))) * (latent_heat_flux_d[pos] / (net_radiation_d[pos] - soil_heat_d[pos])) * net_radiation_24h_d[pos];

I’ve implemented this using both Stream Processors and cuTensor, as shown below:
Stream Processors:

global void evapotranspiration_24h_kernel(float *surface_temperature_d, float *latent_heat_flux_d, float *net_radiation_d, float *soil_heat_d, float *net_radiation_24h_d, float *evapotranspiration_24h_d) {
   unsigned int idx = threadIdx.x + blockIdx.x * blockDim.x;
   if (idx < width_d * height_d) {
      unsigned int row = idx / width_d;
      unsigned int col = idx % width_d;
      unsigned int pos = row * width_d + col;
      float temperature_celcius = surface_temperature_d[pos] - 273.15;
      evapotranspiration_24h_d[pos] = (86400 / ((2.501 - 0.0236 * temperature_celcius) * pow(10, 6))) * (latent_heat_flux_d[pos] / (net_radiation_d[pos] - soil_heat_d[pos])) * net_radiation_24h_d[pos];
  }
}

cuTensor:

// tensor_plan_binary_div config:
HANDLE_CUTENSOR_ERROR(cutensorCreateElementwiseBinary(this->handle, &desc,
this->descA, this->axis.data(), CUTENSOR_OP_IDENTITY,
this->descB, this->axis.data(), CUTENSOR_OP_RCP,
this->descC, this->axis.data(),
CUTENSOR_OP_MUL, descCompute));

// tensor_plan_binary_add config:
HANDLE_CUTENSOR_ERROR(cutensorCreateElementwiseBinary(this->handle, &desc,
this->descA, this->axis.data(), CUTENSOR_OP_IDENTITY,
this->descB, this->axis.data(), CUTENSOR_OP_IDENTITY,
this->descC, this->axis.data(),
CUTENSOR_OP_ADD, descCompute));

// (86400 / ((2.501 - 0.0236 * temperature_celcius) * pow(10, 6)))    HANDLE_CUTENSOR_ERROR(cutensorElementwiseBinaryExecute(tensors.handle, tensors.tensor_plan_binary_add, (void *)&pos1, products.surface_temperature_d, (void *)&neg27315, products.only1_d, products.tensor_aux1_d, tensors.stream));
HANDLE_CUTENSOR_ERROR(cutensorElementwiseBinaryExecute(tensors.handle, tensors.tensor_plan_binary_add, (void *)&pos2501, products.only1_d, (void *)&neg00236, products.tensor_aux1_d, products.tensor_aux1_d, tensors.stream));
HANDLE_CUTENSOR_ERROR(cutensorPermute(tensors.handle, tensors.tensor_plan_permute_id, (void *)&pow10, products.tensor_aux1_d, products.tensor_aux1_d, tensors.stream));
HANDLE_CUTENSOR_ERROR(cutensorElementwiseBinaryExecute(tensors.handle, tensors.tensor_plan_binary_div, (void *)&pos86400, products.only1_d, (void *)&pos1, products.tensor_aux1_d, products.tensor_aux1_d, tensors.stream));

// (latent_heat_flux_d[pos] / (net_radiation_d[pos] - soil_heat_d[pos]))    HANDLE_CUTENSOR_ERROR(cutensorElementwiseBinaryExecute(tensors.handle, tensors.tensor_plan_binary_add, (void *)&pos1, products.net_radiation_d, (void *)&neg1, products.soil_heat_d, products.evapotranspiration_24h_d, tensors.stream));
HANDLE_CUTENSOR_ERROR(cutensorElementwiseBinaryExecute(tensors.handle, tensors.tensor_plan_binary_div, (void *)&pos1, products.latent_heat_flux_d, (void *)&pos1, products.evapotranspiration_24h_d, products.evapotranspiration_24h_d, tensors.stream));

// evapotranspiration_24h_d[pos] = (86400 / ((2.501 - 0.0236 * temperature_celcius) * pow(10, 6))) * (latent_heat_flux_d[pos] / (net_radiation_d[pos] - soil_heat_d[pos])) * net_radiation_24h_d[pos];
HANDLE_CUTENSOR_ERROR(cutensorElementwiseBinaryExecute(tensors.handle, tensors.tensor_plan_binary_mult, (void *)&pos1, products.tensor_aux1_d, (void *)&pos1, products.evapotranspiration_24h_d, products.evapotranspiration_24h_d, tensors.stream));
HANDLE_CUTENSOR_ERROR(cutensorElementwiseBinaryExecute(tensors.handle, tensors.tensor_plan_binary_mult, (void *)&pos1, products.evapotranspiration_24h_d, (void *)&pos1, products.net_radiation_24h_d, products.evapotranspiration_24h_d, tensors.stream));

Experiment details:

GPU: [specific model - e.g., RTX 3090/A100]
Architecture: Ampere
Precision: FP32
Operation: Hadamard product (element-wise)
Matrix dimensions: [dimensions]
Libraries compared: cuTENSOR vs [implementation used with CUDA Cores]
Metrics: [execution time/throughput observed in both cases]

In this optimization, the Stream Processors implementation took 19.06 ms while the cuTensor implementation took only 5.32 ms.

I would like help to better understand several aspects:

Despite significantly reducing the execution time, this code seems somewhat hacky? Is there a more appropriate way to implement it?
Have you explored any similar approaches? Has the NVIDIA team encountered other researchers observing similar performance gains for element-wise operations using cuTENSOR?
Is there any documentation or specific applicability of tensor operations for element-wise operations?
How would you explain this time difference?
Does this approach make sense? Is this performance gain reasonable?

njuffa · March 15, 2025, 3:01am

cilas:

      float temperature_celcius = surface_temperature_d[pos] - 273.15;
      evapotranspiration_24h_d[pos] = (86400 / ((2.501 - 0.0236 * temperature_celcius) * pow(10, 6))) * (latent_heat_flux_d[pos] / (net_radiation_d[pos] - soil_heat_d[pos])) * net_radiation_24h_d[pos];

[1] Your variables are of type float, but your literal floating-point constants are of type double, turning the entire computation into double-computation, which is very slow on consumer-grade GPUs (e.g. RTX 3090).

[2] In general, pow() is one of the slowest math functions. While the CUDA math library may include optimizations for pow() with integer arguments, and the compiler may optimize to compile-time computation, this is not something that can be assumed and relied on. Most likely this maps to a slow double implementation of pow(). Best to just use a floating-point literal: 10e6f or 1000000.0f.

Unless width_d is a compile-time constant, this will result in comparatively slow integer divisions. Whether the CUDA compiler can unify the / and % operators into a single division (with back multiply to compute the remainder) is not certain.

On second thought, this entire row, col computation seems redundant. Isn’t it the case that pos == idx, and row and col are never used elsewhere?

__global__ void evapotranspiration_24h_kernel(float *surface_temperature_d, float *latent_heat_flux_d, float *net_radiation_d, float *soil_heat_d, float *net_radiation_24h_d, float *evapotranspiration_24h_d) {

It probably does not matter in a kernel this simple as the compiler can probably figure this out by itself, but in general I would suggest using const and __restrict__ qualifiers as applicable:

__ global__ void evapotranspiration_24h_kernel (const float * __restrict__ surface_temperature_d,
                                                const float * __restrict__ latent_heat_flux_d, 
                                                const float * __restrict__ net_radiation_d, 
                                                const float * __restrict__ soil_heat_d, 
                                                const float * __restrict__ net_radiation_24h_d, 
                                                float * __restrict__ evapotranspiration_24h_d)

Please note that the CUDA compiler treats floating-point expressions conservatively and will not perform floating-point re-association with the exception of contracting FMUL with dependent FADD into FMA. You would therefore want to apply your own algebraic simplifications where numerical properties allow it. For example:

__global__ void evapotranspiration_24h_kernel (const float * __restrict__ surface_temperature_d, 
                                               const float * __restrict__ latent_heat_flux_d, 
                                               const float * __restrict__ net_radiation_d, 
                                               const float * __restrict__ soil_heat_d, 
                                               const float * __restrict__ net_radiation_24h_d, 
                                               float * __restrict__ evapotranspiration_24h_d) 
{
   unsigned int pos = threadIdx.x + blockIdx.x * blockDim.x;
   if (pos < width_d * height_d) {
      evapotranspiration_24h_d[pos]=(-3.66102f * latent_heat_flux_d[pos] * net_radiation_24h_d[pos]) / 
                                    ((surface_temperature_d[pos] - 379.125f) * (net_radiation_d[pos] - soil_heat_d[pos]));
  }
}

system · March 31, 2025, 1:03pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
GPU/CPU precision comparison and Kernel instructions question CUDA Programming and Performance	5	679	April 4, 2017
How can I accelerate this code? CUDA Programming and Performance	3	545	July 12, 2017
Strange Variations in Execution Time of cublas<t>geam() [cublasDgemm] GPU-Accelerated Libraries cublas	9	955	September 2, 2021
Differences in Precision Between Tensor Cores and CUDA Cores CUDA Programming and Performance cuda	1	91	January 10, 2025
CUDA hangups Jetson TK1	26	3667	October 18, 2021
Execution timings varying from instance to instance CUDA Programming and Performance	10	483	September 29, 2023
Cuda code performance CUDA Programming and Performance	14	3144	December 16, 2014
CUDA execution multiples of 16ms CUDA Programming and Performance	14	2056	May 30, 2015
Floyd algorithm problem Floyd cuda implementation getting wrong results. CUDA Programming and Performance	28	13774	May 9, 2009
Difference between Device emulation and execution modes CUDA Programming and Performance	8	4436	May 12, 2009

Experiment to reduce time execution with tensors

Related topics