Hello everyone,
I am a master’s researcher at UFCG and I’m working on optimizing an algorithm to reduce its execution time as much as possible. The calculation is based on multiplying a matrix by constants.
Formula:
evapotranspiration_24h_d[pos] = (86400 / ((2.501 - 0.0236 * temperature_celcius) * pow(10, 6))) * (latent_heat_flux_d[pos] / (net_radiation_d[pos] - soil_heat_d[pos])) * net_radiation_24h_d[pos];
I’ve implemented this using both Stream Processors and cuTensor, as shown below:
Stream Processors:
global void evapotranspiration_24h_kernel(float *surface_temperature_d, float *latent_heat_flux_d, float *net_radiation_d, float *soil_heat_d, float *net_radiation_24h_d, float *evapotranspiration_24h_d) {
unsigned int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < width_d * height_d) {
unsigned int row = idx / width_d;
unsigned int col = idx % width_d;
unsigned int pos = row * width_d + col;
float temperature_celcius = surface_temperature_d[pos] - 273.15;
evapotranspiration_24h_d[pos] = (86400 / ((2.501 - 0.0236 * temperature_celcius) * pow(10, 6))) * (latent_heat_flux_d[pos] / (net_radiation_d[pos] - soil_heat_d[pos])) * net_radiation_24h_d[pos];
}
}
cuTensor:
// tensor_plan_binary_div config:
HANDLE_CUTENSOR_ERROR(cutensorCreateElementwiseBinary(this->handle, &desc,
this->descA, this->axis.data(), CUTENSOR_OP_IDENTITY,
this->descB, this->axis.data(), CUTENSOR_OP_RCP,
this->descC, this->axis.data(),
CUTENSOR_OP_MUL, descCompute));
// tensor_plan_binary_add config:
HANDLE_CUTENSOR_ERROR(cutensorCreateElementwiseBinary(this->handle, &desc,
this->descA, this->axis.data(), CUTENSOR_OP_IDENTITY,
this->descB, this->axis.data(), CUTENSOR_OP_IDENTITY,
this->descC, this->axis.data(),
CUTENSOR_OP_ADD, descCompute));
// (86400 / ((2.501 - 0.0236 * temperature_celcius) * pow(10, 6))) HANDLE_CUTENSOR_ERROR(cutensorElementwiseBinaryExecute(tensors.handle, tensors.tensor_plan_binary_add, (void *)&pos1, products.surface_temperature_d, (void *)&neg27315, products.only1_d, products.tensor_aux1_d, tensors.stream));
HANDLE_CUTENSOR_ERROR(cutensorElementwiseBinaryExecute(tensors.handle, tensors.tensor_plan_binary_add, (void *)&pos2501, products.only1_d, (void *)&neg00236, products.tensor_aux1_d, products.tensor_aux1_d, tensors.stream));
HANDLE_CUTENSOR_ERROR(cutensorPermute(tensors.handle, tensors.tensor_plan_permute_id, (void *)&pow10, products.tensor_aux1_d, products.tensor_aux1_d, tensors.stream));
HANDLE_CUTENSOR_ERROR(cutensorElementwiseBinaryExecute(tensors.handle, tensors.tensor_plan_binary_div, (void *)&pos86400, products.only1_d, (void *)&pos1, products.tensor_aux1_d, products.tensor_aux1_d, tensors.stream));
// (latent_heat_flux_d[pos] / (net_radiation_d[pos] - soil_heat_d[pos])) HANDLE_CUTENSOR_ERROR(cutensorElementwiseBinaryExecute(tensors.handle, tensors.tensor_plan_binary_add, (void *)&pos1, products.net_radiation_d, (void *)&neg1, products.soil_heat_d, products.evapotranspiration_24h_d, tensors.stream));
HANDLE_CUTENSOR_ERROR(cutensorElementwiseBinaryExecute(tensors.handle, tensors.tensor_plan_binary_div, (void *)&pos1, products.latent_heat_flux_d, (void *)&pos1, products.evapotranspiration_24h_d, products.evapotranspiration_24h_d, tensors.stream));
// evapotranspiration_24h_d[pos] = (86400 / ((2.501 - 0.0236 * temperature_celcius) * pow(10, 6))) * (latent_heat_flux_d[pos] / (net_radiation_d[pos] - soil_heat_d[pos])) * net_radiation_24h_d[pos];
HANDLE_CUTENSOR_ERROR(cutensorElementwiseBinaryExecute(tensors.handle, tensors.tensor_plan_binary_mult, (void *)&pos1, products.tensor_aux1_d, (void *)&pos1, products.evapotranspiration_24h_d, products.evapotranspiration_24h_d, tensors.stream));
HANDLE_CUTENSOR_ERROR(cutensorElementwiseBinaryExecute(tensors.handle, tensors.tensor_plan_binary_mult, (void *)&pos1, products.evapotranspiration_24h_d, (void *)&pos1, products.net_radiation_24h_d, products.evapotranspiration_24h_d, tensors.stream));
Experiment details:
- GPU: [specific model - e.g., RTX 3090/A100]
- Architecture: Ampere
- Precision: FP32
- Operation: Hadamard product (element-wise)
- Matrix dimensions: [dimensions]
- Libraries compared: cuTENSOR vs [implementation used with CUDA Cores]
- Metrics: [execution time/throughput observed in both cases]
In this optimization, the Stream Processors implementation took 19.06 ms while the cuTensor implementation took only 5.32 ms.
I would like help to better understand several aspects:
-
Despite significantly reducing the execution time, this code seems somewhat hacky? Is there a more appropriate way to implement it?
-
Have you explored any similar approaches? Has the NVIDIA team encountered other researchers observing similar performance gains for element-wise operations using cuTENSOR?
-
Is there any documentation or specific applicability of tensor operations for element-wise operations?
-
How would you explain this time difference?
-
Does this approach make sense? Is this performance gain reasonable?