Hi! I am new to cuTensorNet. When I optimize the contraction path, I see that when the number of samples is increased, the CPU utilization goes up, and the total run time increases as well. I have the following questions:

Do these samples have dependencies on each other?

Is it possible to utilize a supercomputer to accelerate the optimization on multiple nodes when the number of samples is high?

Is the sec/sample optimization time in the SDK webpage cuQuantum SDK | NVIDIA obtained by dividing the total optimization time by the number of samples?

I am asking this because a striking 2.6-second optimization time is claimed for Sycamore m20. However, it is not clear to me whether this 2.6 seconds ‘per sample’ can translate into a similar total optimization time by parallelization (assuming a powerful supercomputer is available).

Thank you so much for the reply and it’s very helpful! Gray et al. report that there is Bayesian optimization in the hyper-optimizer. Would it be true to say that if I choose samples=1, then within this sample, it uses Bayesian optimization? Or is cuTensorNet not doing that?

No, each sample is just a contraction path candidate, and they all are currently evaluated independently, so the entire hyper-sampling procedure should scale very well across many nodes/threads. You just need to activate distributed execution via cutensornetDistributedResetConfiguration() and request a sufficiently large number of samples, based on the number of nodes you plan to run on.

Let me re-iterate and copy the above answers with more details.
cuTensorNet has two components, 1-the pathfinder (the phase that find a path that optimize the contractions sequence in order to minimize the computation cost) then 2- the computation of the contractions on GPU.

The pathfinder generate a path each sample and keep the best one, this phase is CPU only and is samples are completely independent from each other thus can be generated independently in parallel (multiple threads and/or multiple nodes). Samples can have different time (for example if we take the 2.6sec, you can find some sample with 0.8sec other with 4sec. the 2.6sec is the average over 1000 samples. if you run more samples, I am sure the avg time will be smaller.
Once a path is defined, the computation phase will occur and here you have also 2 additional features of optimizations the autotuning for kernels and the autotuning for mode ordering. if you have many slices or you want to performs many execution, it is better to run with the autotuning you will get the best performance.

The hyper-sampling procedure should be embarrassingly parallel (no deps).

Yes, the automatic distributed parallelization feature of cuTensorNet will parallelize path-finder’s hyper-sampling procedure across many nodes.