Performance difference in large two-tensor contraction between cuQuantum and CuPy

subini0213 · August 16, 2025, 5:52pm

Hello,

I am testing the performance of cuquantum.tensornet.contract against cupy.einsum for a simple two-tensor contraction. I noticed that cuquantum.tensornet.contract is much slower in this case.

Here is the minimal code I used:

import time, cupy as cp
from cuquantum import tensornet

print(“CuPy version:”, cp.version)

d = 1000
t1 = cp.random.standard_normal([500, d, 30])
t2 = cp.random.standard_normal([d, 500, 30])

start = time.time()
for i in range(10):

A = cp.einsum(“axb,xij->aibj”, t1, t2) # cupy

#A = tensornet.contract("axb,xij->aibj", t1, t2) # cuquantum

end = time.time()

print(“elapsed:”, end-start, “s”)

Results:

cupy.einsum: ~0.01s
tensornet.contract: ~18s
For PyTorch tensors, torch.einsum and tensornet.contract show similar timings.

Question:

Is this large performance gap expected for two-tensor contractions?
Or am I missing something (e.g., options, path planning, autotuning, etc.) to make tensornet.contract more efficient in this simple case?

The environment is T4GPU runtime of google colab.

user163422 · August 18, 2025, 6:47pm

Hello,

There are a couple of things coupled here.

For GPU applications which are often asynchronous, CPU timer does not provide accurate timing. The recommendation is to switch to event based timers or just cupyx.profiler.benchmark() . You may also play around with the snippet attached below
cuquantum.tensornet.contract is generally meant for handling large tensor network contraction with optimal path finding. Generally the performance is not guaranteed to be optimal for small network, e.g, binary tensor contraction. Especially in your case where the problem essentially reduces to a gemm directly, cupy is likely directly dispatching to cublas and that would be optimal by nature. cutensornet generally dispatches to cutensor with the ability to autotune for generic einsum problems. Please check the snippet below for recommended non-blocking class-based usage.
For future inquires, we would also encourage to post on our github channel under either Issues or Discussion page for quick response, e.g, GitHub · Where software is built.

import cupy as cp
from cuquantum import tensornet

from cupyx.profiler import benchmark

d = 1000
t1 = cp.random.standard_normal([500, d, 30])
t2 = cp.random.standard_normal([d, 500, 30])

# 1. cupy
print(benchmark(cp.einsum, ("axb,xij->aibj", t1, t2), n_repeat=20))  
# 2. function version
print(benchmark(tensornet.contract, ("axb,xij->aibj", t1, t2), kwargs={'options': {'blocking': "auto"}}, n_repeat=20))  

# 3. class version
tn = tensornet.Network("axb,xij->aibj", t1, t2, options={'blocking': "auto"})

# find the best path, only once
tn.contract_path()

# optional, autotune to find the optimal kernel
tn.autotune(iterations=5)

print(benchmark(tn.contract, n_repeat=20)) 
# release resource
tn.free()

Topic		Replies	Views
cuTENSOR 2.0: Applications and Performance Technical Blog	1	265	March 10, 2024
CuTensorNet Converter i.e. CircuitToEinsum is Not working for the Github Example cuQuantum cuda , ubuntu , python , cuquantum	3	405	January 28, 2024
cuTensor contraction ~5X slower than equivalent CuBLAS sgemm? GPU-Accelerated Libraries	0	1032	August 30, 2020
cuTENSOR 2.0: A Comprehensive Guide for Accelerating Tensor Computations Technical Blog	1	292	March 10, 2024
Concurrent contractions/qr in cutensornet cuQuantum	2	62	August 19, 2025
cuTENSOR performance for small contraction extents CUDA Programming and Performance	3	356	November 24, 2020
tensor contraction tips? Efficient way to contract rank-n tensor? CUDA Programming and Performance	3	2391	August 15, 2008
Programming Distributed Multi-GPU Tensor Operations with cuTENSOR v1.4 Technical Blog	0	392	November 29, 2021
Advices needed tensor contractions CUDA Programming and Performance	0	1023	August 8, 2008
Experiment to reduce time execution with tensors CUDA Programming and Performance	2	33	March 15, 2025

Performance difference in large two-tensor contraction between cuQuantum and CuPy

Related topics