Performance difference in large two-tensor contraction between cuQuantum and CuPy

Hello,

I am testing the performance of cuquantum.tensornet.contract against cupy.einsum for a simple two-tensor contraction. I noticed that cuquantum.tensornet.contract is much slower in this case.

Here is the minimal code I used:

import time, cupy as cp
from cuquantum import tensornet

print(“CuPy version:”, cp.version)

d = 1000
t1 = cp.random.standard_normal([500, d, 30])
t2 = cp.random.standard_normal([d, 500, 30])

start = time.time()
for i in range(10):

A = cp.einsum(“axb,xij->aibj”, t1, t2) # cupy

#A = tensornet.contract("axb,xij->aibj", t1, t2) # cuquantum

end = time.time()

print(“elapsed:”, end-start, “s”)

Results:

  • cupy.einsum: ~0.01s

  • tensornet.contract: ~18s

  • For PyTorch tensors, torch.einsum and tensornet.contract show similar timings.

Question:

  • Is this large performance gap expected for two-tensor contractions?

  • Or am I missing something (e.g., options, path planning, autotuning, etc.) to make tensornet.contract more efficient in this simple case?

The environment is T4GPU runtime of google colab.

Hello,

There are a couple of things coupled here.

  1. For GPU applications which are often asynchronous, CPU timer does not provide accurate timing. The recommendation is to switch to event based timers or just cupyx.profiler.benchmark() . You may also play around with the snippet attached below
  2. cuquantum.tensornet.contract is generally meant for handling large tensor network contraction with optimal path finding. Generally the performance is not guaranteed to be optimal for small network, e.g, binary tensor contraction. Especially in your case where the problem essentially reduces to a gemm directly, cupy is likely directly dispatching to cublas and that would be optimal by nature. cutensornet generally dispatches to cutensor with the ability to autotune for generic einsum problems. Please check the snippet below for recommended non-blocking class-based usage.
  3. For future inquires, we would also encourage to post on our github channel under either Issues or Discussion page for quick response, e.g, GitHub · Where software is built.
import cupy as cp
from cuquantum import tensornet

from cupyx.profiler import benchmark

d = 1000
t1 = cp.random.standard_normal([500, d, 30])
t2 = cp.random.standard_normal([d, 500, 30])

# 1. cupy
print(benchmark(cp.einsum, ("axb,xij->aibj", t1, t2), n_repeat=20))  
# 2. function version
print(benchmark(tensornet.contract, ("axb,xij->aibj", t1, t2), kwargs={'options': {'blocking': "auto"}}, n_repeat=20))  

# 3. class version
tn = tensornet.Network("axb,xij->aibj", t1, t2, options={'blocking': "auto"})

# find the best path, only once
tn.contract_path()

# optional, autotune to find the optimal kernel
tn.autotune(iterations=5)

print(benchmark(tn.contract, n_repeat=20)) 
# release resource
tn.free()